Python: print binary as string

Question

I have a crawler that parses the HMTL of a given site and prints parts of the source code. Here is my script:

#!/usr/bin/env python
# -*- encoding: utf-8 -*-
from bs4 import BeautifulSoup
import requests
import urllib.request
import re

class Crawler:

    headers = {'User-Agent' : 'Mozilla/5.0'}
    keyword = 'arroz'

    def extra(self):
        url = "http://buscando.extra.com.br/search?w=" + self.keyword
        r = requests.head(url, allow_redirects=True)    
        print(r.url)
        html = urllib.request.urlopen(urllib.request.Request(url, None, self.headers)).read()
        soup = BeautifulSoup(html, 'html.parser')
        return soup.encode('utf-8')

    def __init__(self):
        extra = self.extra()
        print(extra)

Crawler()

My code works fine, but it prints the source with an annoying b' before the value. I already tried to use decode('utf-8') but it didn't work. Any ideas?

UPDATE

If I don't use the encode('utf-8') I have the following error:

Traceback (most recent call last):
  File "crawler.py", line 25, in <module>
    Crawler()
  File "crawler.py", line 23, in __init__
    print(extra)
  File "c:\Python34\lib\encodings\cp850.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u2030' in position
13345: character maps to <undefined>

Without this it returns the following error: Traceback (most recent call last): File "crawler.py", line 25, in <module> Crawler() File "crawler.py", line 23, in __init__ print(extra) File "c:\Python34\lib\encodings\cp850.py", line 19, in encode return codecs.charmap_encode(input,self.errors,encoding_map)[0] UnicodeEncodeError: 'charmap' codec can't encode character '\u2030' in position 13345: character maps to <undefined> — bodruk
– bodruk, Commented Nov 1, 2015 at 3:07
bytes doesn't have an encode method in Python 3 so you are starting with a string and converting it to a byte string — chucksmash
– chucksmash, Commented Nov 1, 2015 at 3:07
@chucksmash Using encode('utf-8') I've stoped the error above. But I've created a new one: UnicodeEncodeError: 'charmap' codec can't encode character '\u2030' in position 13345: character maps to <undefined>. — bodruk
– bodruk, Commented Nov 1, 2015 at 3:09

Community · Accepted Answer · 2017-05-23 12:24:00Z

1

When I run your code as given except replacing return soup.encode('utf-8') with return soup, it works fine. My environment:

OS: Ubuntu 15.10
Python: 3.4.3
python3 dist-packages bs4 version: beautifulsoup4==4.3.2

This leads me to suspect that the problem lies with your environment, not your code. Your stack trace mentions cp850.py and your source is hitting a .com.br site - this makes me think that perhaps the default encoding of your shell can't handle unicode characters. Here's the Wikipedia page for cp850 - Code Page 850.

You can check the default encoding your terminal is using with:

>>> import sys
>>> sys.stdout.encoding

My terminal responds with:

'UTF-8'

I'm assuming yours won't and that this is the root of the issue you are running into.

EDIT:

In fact, I can exactly replicate your error with:

>>> print("\u2030".encode("cp850"))

So that's the issue - because of your computer's locale settings, print is implicitly converting to your system's default encoding and raising the UnicodeDecodeError.

Updating Windows to display unicode characters from the command prompt is a bit outside my wheelhouse so I can't offer any advice other than to direct you to a relevant question/answer.

edited May 23, 2017 at 12:24

CommunityBot

11 silver badge

answered Nov 1, 2015 at 4:26

chucksmash

6,0671 gold badge36 silver badges41 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

bodruk Over a year ago

Yeah! It returned cp850. What should I do?

bodruk Over a year ago

I've tried the link solution, without success... thanks tho.

Collectives™ on Stack Overflow

Python: print binary as string

1 Answer 1

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related