Extracting text from HTML file using Python

Question

I'd like to extract the text from an HTML file using Python. I want essentially the same output I would get if I copied the text from a browser and pasted it into notepad.

I'd like something more robust than using regular expressions that may fail on poorly formed HTML. I've seen many people recommend Beautiful Soup, but I've had a few problems using it. For one, it picked up unwanted text, such as JavaScript source. Also, it did not interpret HTML entities. For example, I would expect ' in HTML source to be converted to an apostrophe in text, just as if I'd pasted the browser content into notepad.

Update html2text looks promising. It handles HTML entities correctly and ignores JavaScript. However, it does not exactly produce plain text; it produces markdown that would then have to be turned into plain text. It comes with no examples or documentation, but the code looks clean.

Comments

Waqar Detho · Accepted Answer · 2016-08-07 17:27:43Z

0

I am achieving it something like this.

>>> import requests
>>> url = "http://news.bbc.co.uk/2/hi/health/2284783.stm"
>>> res = requests.get(url)
>>> text = res.text

answered Aug 7, 2016 at 17:27

Waqar Detho

1,51218 silver badges17 bronze badges

2 Comments

Waqar Detho Over a year ago

I am using python 3.4 and this code is working fine for me.

Ivelin Over a year ago

text would have html tags in it

Ivan · Accepted Answer · 2018-05-15 22:54:37Z

0

The LibreOffice writer comment has merit since the application can employ python macros. It seems to offer multiple benefits both for answering this question and furthering the macro base of LibreOffice. If this resolution is a one-off implementation, rather than to be used as part of a greater production program, opening the HTML in writer and saving the page as text would seem to resolve the issues discussed here.

edited May 15, 2018 at 22:54

Ivan

41.3k9 gold badges78 silver badges120 bronze badges

answered May 15, 2018 at 19:36

1of7

414 bronze badges

Comments

brunql · Accepted Answer · 2018-07-06 11:36:06Z

0

Perl way (sorry mom, i'll never do it in production).

import re

def html2text(html):
    res = re.sub('<.*?>', ' ', html, flags=re.DOTALL | re.MULTILINE)
    res = re.sub('\n+', '\n', res)
    res = re.sub('\r+', '', res)
    res = re.sub('[\t ]+', ' ', res)
    res = re.sub('\t+', '\t', res)
    res = re.sub('(\n )+', '\n ', res)
    return res

answered Jul 6, 2018 at 11:36

brunql

4455 silver badges6 bronze badges

2 Comments

Uri Goren Over a year ago

This is bad practice for so many reason, for example  

brunql Over a year ago

Yes! It's true! Don't do it anythere!

Haider · Accepted Answer · 2021-07-28 01:47:31Z

All methods here did not work quite well with some websites. The paragraphs that are generated by the JS code were resistant to all the above. Here is what eventually worked for me inspired by this answer and this.

The idea is to load the page in webdriver and scroll to the end of the page to make JS do its thing to generate/load the rest of the page. Then insert keystroke commands to select all copy/paste the whole page:

import selenium
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import pyperclip
import time

driver = webdriver.Chrome()
driver.get("https://www.lazada.com.ph/products/nike-womens-revolution-5-running-shoes-black-i1262506154-s4552606107.html?spm=a2o4l.seller.list.3.6f5d7b6cHO8G2Y&mp=1&freeshipping=1")

# Scroll down to end of the page to let all javascript code load its content
lenOfPage = driver.execute_script("window.scrollTo(0, document.body.scrollHeight);var lenOfPage=document.body.scrollHeight;return lenOfPage;")
match=False
while(match==False):
        lastCount = lenOfPage
        time.sleep(1)
        lenOfPage = driver.execute_script("window.scrollTo(0, document.body.scrollHeight);var lenOfPage=document.body.scrollHeight;return lenOfPage;")
        if lastCount==lenOfPage:
            match=True

# copy from the webpage
element = driver.find_element_by_tag_name('body')
element.send_keys(Keys.CONTROL,'a')
element.send_keys(Keys.CONTROL,'c')
alltext = pyperclip.paste()
alltext = alltext.replace("\n", " ").replace("\r", " ")  # cleaning the copied text
print(alltext )

It is slow. But nothing else did work out.

UPDATE: A better method is to load the source of the page AFTER scrolling to the end of the page using inscriptis library:

from inscriptis import get_text
text = get_text(driver.page_source)

Still will not work with a headless driver (page detects somehow that it is not shown by real and scroll to end will not make JS code loading its thing), but at least we don't need the crazy copy/paste which hinders us from running multiple scripts on a machine with a shared clipboard.

Alan Hamlett · Accepted Answer · 2023-07-18 17:25:59Z

I like using pyquery to solve this:

from pyquery import PyQuery as pq


def html_to_text(html):
    """Return a list of the visible utf8 text for some HTML string."""

    if not html:
        return []

    if not isinstance(html, pq):
        html = pq(html)

    skip = ['style', 'title', 'noscript', 'head', 'meta']

    text = []

    try:
        if html.tag and html.tag.lower() in skip:
            return []
    except AttributeError:
        pass

    try:
        style = dict([y.strip() for y in x.strip().split(":")] for x in html.attr.style.split(";") if x.strip())
        if style["display"].lower() == "none":
            return []
    except (AttributeError, KeyError):
        pass

    for el in html:
        try:
            if not el.tag or el.tag.lower() in skip:
                continue
        except AttributeError:
            continue

        for child in el.getchildren():
            text.extend(html_to_text(child))

        if not el.text:
            continue

        text.append(el.text)

    return text


print(" ".join(html_to_text("<p>test</p>")))

Thaif · Accepted Answer · 2024-01-20 13:44:41Z

0

I came here looking for answers about my own code but I feel like I can help here (I hope you guys give some feedbacks on my code, its the first one):

import requests
from bs4 import BeautifulSoup
import pandas as pd

#creating a fake header to avoid Forbbiden
headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36'}

page = 'your-url'
pageTree = requests.get(page, headers=headers)
pageSoup = BeautifulSoup(pageTree.text, 'html.parser')

#here you can get all the code from a div or a td:
code = pageSoup.find_all("your-tag(e.g. div)", attrs={'class': 'your-class'})

#In my code, I was trying to get the numbers from a table with 82 rolls, so I wrote this:
i=1
text = []
while i<83:
  #I had to do some math to get the specific text I wanted at code[6*i-1] bellow:
  clean_text = code[6*i-1].text
  #solving the problem with the encoding:
  get = clean_text.replace(u'\xa0', '')
  get = get.replace(u'-', '')
  text.append(get)

I believe that with this one you can get all the text you need but will be on a list of rolls.

answered Jan 20, 2024 at 13:44

Thaif

1

1 Comment

Thaif Over a year ago

Oh, I used pandas further in the code, should not be there. Sorry!

Collectives™ on Stack Overflow

Extracting text from HTML file using Python

37 Answers 37

Comments

2 Comments

Comments

2 Comments

Comments

Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

37 Answers 37

Comments

2 Comments

Comments

2 Comments

Comments

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related