0

I'm using Beautiful Soup to scrape Reddit. After scraping, I get some amount of encoding in the final text after finding the paragraph tag in the HTMl code and taking it's text.

# finds all paragraphs in the html file, and parse through them to pull out all the text without attributes
text = []
print(soup.original_encoding)
for paragraph in soup.find_all("p"):
    if not paragraph.attrs:
        text.append(paragraph.text)

and, for example, returns:

['\n      LastÂ\xa0month,Â\xa0DaveÂ\xa0startedÂ\xa0organizingÂ\xa0aÂ\xa0fundraiserÂ\xa0forÂ\xa0aÂ\xa0pediatricÂ\xa0cancerÂ\xa0hospital.Â\xa0HeÂ\xa0wasÂ\xa0promotingÂ\xa0itÂ\xa0heavilyÂ\xa0inÂ\xa0theÂ\xa0office,Â\xa0andÂ\xa0allÂ\xa0hisÂ\xa0emailsÂ\xa0andÂ\xa0conversationsÂ\xa0madeÂ\xa0itÂ\xa0seemÂ\xa0likeÂ\xa0thisÂ\xa0wasÂ\xa0beingÂ\xa0doneÂ\xa0onÂ\xa0behalfÂ\xa0ofÂ\xa0theÂ\xa0company.\n    ',...]

I'm trying to encode and decode it, it just throws an error, and ignoring errors removes all the spaces. I can't find anything online about this, or at least I haven't been able to yet.

7
  • 2
    you should add more details such as the URL or the HTML of the page to reproduce the issue and help you out Commented May 22 at 1:34
  • What is the original encoding of the text? Commented May 22 at 2:21
  • The question has enough content, in code and data samples, to be answerable by people who understand text encoding workings and patterns. Voting to reopen. Commented May 22 at 14:36
  • Perhaps if the question included a small sample with a non-breaking space, it would be possible to reproduce. I'd suggest looking at stackoverflow.com/a/1462039/1766544 Commented May 22 at 16:24
  • I tried reproducing from '<html><p>Sample&nbsp;text.</p></html>' but it printed out just fine. There's not enough code to figure out how OP got to A-caret. Also OP has some code that "throws an error", but didn't post that code or the exception trace. (Also an import statement would help.) Commented May 22 at 16:37

1 Answer 1

2

These characters are how utf-8 encoded no-break space characters (HTML &nbsp;, unicode code-point \xa0) would show up in a byte stream, if that stream was decoded as latin1 -

Your code doesn't show how you are retrieving the HTML and feeding int into BS4, but it is likely you are assuming the encdoing to be "latin1" and possibly ignoring HTTP headers telling you otherwise (the content is likely utf-8).

The correct way to fix that is to let whatever tool you are using to fetch the web contents to do the text decoding itself - it should interpret any metadata with the text encoding and apply that for you.

However, without seeing that code, ang given some string with that content, what you to to fix it is: first encode it back to bytes using "latin1", and then decode that byte-string as utf-8:

           for paragraph in soup.find_all("p"):
            if not paragraph.attrs:
                text.append(paragraph.text.encode("latin1").decode("utf-8"))

But note that feeding the incorrectly decoded text into beautifulsoup, which is done before that, might yield an invalid HTML document - as some control characters might show up in the sequence that were not intended to be there.

Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.