I'm using Beautiful Soup to scrape Reddit. After scraping, I get some amount of encoding in the final text after finding the paragraph tag in the HTMl code and taking it's text.
# finds all paragraphs in the html file, and parse through them to pull out all the text without attributes
text = []
print(soup.original_encoding)
for paragraph in soup.find_all("p"):
if not paragraph.attrs:
text.append(paragraph.text)
and, for example, returns:
['\n LastÂ\xa0month,Â\xa0DaveÂ\xa0startedÂ\xa0organizingÂ\xa0aÂ\xa0fundraiserÂ\xa0forÂ\xa0aÂ\xa0pediatricÂ\xa0cancerÂ\xa0hospital.Â\xa0HeÂ\xa0wasÂ\xa0promotingÂ\xa0itÂ\xa0heavilyÂ\xa0inÂ\xa0theÂ\xa0office,Â\xa0andÂ\xa0allÂ\xa0hisÂ\xa0emailsÂ\xa0andÂ\xa0conversationsÂ\xa0madeÂ\xa0itÂ\xa0seemÂ\xa0likeÂ\xa0thisÂ\xa0wasÂ\xa0beingÂ\xa0doneÂ\xa0onÂ\xa0behalfÂ\xa0ofÂ\xa0theÂ\xa0company.\n ',...]
I'm trying to encode and decode it, it just throws an error, and ignoring errors removes all the spaces. I can't find anything online about this, or at least I haven't been able to yet.