346

I'd like to extract the text from an HTML file using Python. I want essentially the same output I would get if I copied the text from a browser and pasted it into notepad.

I'd like something more robust than using regular expressions that may fail on poorly formed HTML. I've seen many people recommend Beautiful Soup, but I've had a few problems using it. For one, it picked up unwanted text, such as JavaScript source. Also, it did not interpret HTML entities. For example, I would expect ' in HTML source to be converted to an apostrophe in text, just as if I'd pasted the browser content into notepad.

Update html2text looks promising. It handles HTML entities correctly and ignores JavaScript. However, it does not exactly produce plain text; it produces markdown that would then have to be turned into plain text. It comes with no examples or documentation, but the code looks clean.


Related questions:

0

37 Answers 37

1
2
1

Answer using Pandas to get table data from HTML.

If you want to extract table data quickly from HTML. You can use the read_HTML function, docs are here. Before using this function you should read the gotchas/issues surrounding the BeautifulSoup4/html5lib/lxml parsers HTML parsing libraries.

import pandas as pd

http = r'https://www.ibm.com/docs/en/cmofz/10.1.0?topic=SSQHWE_10.1.0/com.ibm.ondemand.mp.doc/arsa0257.htm'
table = pd.read_html(http)
df = table[0]
df

output enter image description here

There are a number of option that can be played with see here and here.

Sign up to request clarification or add additional context in comments.

Comments

0

I am achieving it something like this.

>>> import requests
>>> url = "http://news.bbc.co.uk/2/hi/health/2284783.stm"
>>> res = requests.get(url)
>>> text = res.text

2 Comments

I am using python 3.4 and this code is working fine for me.
text would have html tags in it
0

The LibreOffice writer comment has merit since the application can employ python macros. It seems to offer multiple benefits both for answering this question and furthering the macro base of LibreOffice. If this resolution is a one-off implementation, rather than to be used as part of a greater production program, opening the HTML in writer and saving the page as text would seem to resolve the issues discussed here.

Comments

0

Perl way (sorry mom, i'll never do it in production).

import re

def html2text(html):
    res = re.sub('<.*?>', ' ', html, flags=re.DOTALL | re.MULTILINE)
    res = re.sub('\n+', '\n', res)
    res = re.sub('\r+', '', res)
    res = re.sub('[\t ]+', ' ', res)
    res = re.sub('\t+', '\t', res)
    res = re.sub('(\n )+', '\n ', res)
    return res

2 Comments

This is bad practice for so many reason, for example &nbsp;
Yes! It's true! Don't do it anythere!
0

All methods here did not work quite well with some websites. The paragraphs that are generated by the JS code were resistant to all the above. Here is what eventually worked for me inspired by this answer and this.

The idea is to load the page in webdriver and scroll to the end of the page to make JS do its thing to generate/load the rest of the page. Then insert keystroke commands to select all copy/paste the whole page:

import selenium
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import pyperclip
import time

driver = webdriver.Chrome()
driver.get("https://www.lazada.com.ph/products/nike-womens-revolution-5-running-shoes-black-i1262506154-s4552606107.html?spm=a2o4l.seller.list.3.6f5d7b6cHO8G2Y&mp=1&freeshipping=1")

# Scroll down to end of the page to let all javascript code load its content
lenOfPage = driver.execute_script("window.scrollTo(0, document.body.scrollHeight);var lenOfPage=document.body.scrollHeight;return lenOfPage;")
match=False
while(match==False):
        lastCount = lenOfPage
        time.sleep(1)
        lenOfPage = driver.execute_script("window.scrollTo(0, document.body.scrollHeight);var lenOfPage=document.body.scrollHeight;return lenOfPage;")
        if lastCount==lenOfPage:
            match=True

# copy from the webpage
element = driver.find_element_by_tag_name('body')
element.send_keys(Keys.CONTROL,'a')
element.send_keys(Keys.CONTROL,'c')
alltext = pyperclip.paste()
alltext = alltext.replace("\n", " ").replace("\r", " ")  # cleaning the copied text
print(alltext )

It is slow. But nothing else did work out.

UPDATE: A better method is to load the source of the page AFTER scrolling to the end of the page using inscriptis library:

from inscriptis import get_text
text = get_text(driver.page_source)

Still will not work with a headless driver (page detects somehow that it is not shown by real and scroll to end will not make JS code loading its thing), but at least we don't need the crazy copy/paste which hinders us from running multiple scripts on a machine with a shared clipboard.

Comments

0

I like using pyquery to solve this:

from pyquery import PyQuery as pq


def html_to_text(html):
    """Return a list of the visible utf8 text for some HTML string."""

    if not html:
        return []

    if not isinstance(html, pq):
        html = pq(html)

    skip = ['style', 'title', 'noscript', 'head', 'meta']

    text = []

    try:
        if html.tag and html.tag.lower() in skip:
            return []
    except AttributeError:
        pass

    try:
        style = dict([y.strip() for y in x.strip().split(":")] for x in html.attr.style.split(";") if x.strip())
        if style["display"].lower() == "none":
            return []
    except (AttributeError, KeyError):
        pass

    for el in html:
        try:
            if not el.tag or el.tag.lower() in skip:
                continue
        except AttributeError:
            continue

        for child in el.getchildren():
            text.extend(html_to_text(child))

        if not el.text:
            continue

        text.append(el.text)

    return text


print(" ".join(html_to_text("<p>test</p>")))

Comments

0

I came here looking for answers about my own code but I feel like I can help here (I hope you guys give some feedbacks on my code, its the first one):

import requests
from bs4 import BeautifulSoup
import pandas as pd

#creating a fake header to avoid Forbbiden
headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36'}

page = 'your-url'
pageTree = requests.get(page, headers=headers)
pageSoup = BeautifulSoup(pageTree.text, 'html.parser')

#here you can get all the code from a div or a td:
code = pageSoup.find_all("your-tag(e.g. div)", attrs={'class': 'your-class'})

#In my code, I was trying to get the numbers from a table with 82 rolls, so I wrote this:
i=1
text = []
while i<83:
  #I had to do some math to get the specific text I wanted at code[6*i-1] bellow:
  clean_text = code[6*i-1].text
  #solving the problem with the encoding:
  get = clean_text.replace(u'\xa0', '')
  get = get.replace(u'-', '')
  text.append(get)

I believe that with this one you can get all the text you need but will be on a list of rolls.

1 Comment

Oh, I used pandas further in the code, should not be there. Sorry!
1
2

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.