2

I am web-scraping for textual data that comes in a table such as the following one, and I would like to obtain as the results:

Lorem ipsum
dolor sit amet
consectetur adipiscing elit,
sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.

    html = '''
<table>
<tr class="title last ">
  <td>
   Lorem ipsum
  </td>
  <td>
  </td>
 </tr>
 <tr>
  <td>
   <span class="caps">dolor
   </span>
   sit amet
  </td>
  <td>
  </td>
 </tr>
 <tr>
  <td>
   consectetur adipiscing elit,
  </td>
  <td>
  </td>
 </tr>
 <tr>
  <td>
   sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
  </td>
  <td>
  </td>
 </tr>
 <tr>
  <td>
    Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.
  </td>
  <td>
  </td>
 </tr>
</table>
'''

I unwrapped the <span> element with beautifulsoup4 :

soup = BeautifulSoup(html)

# remove <span> tag but keep content
spans = soup.find_all('span')
for tag in spans:
    tag.unwrap()

However, I come up with either empty lines for all the empty <td> elements, or the line 'dolor sit amet' does not print, even though I can see it when I print the html with prettify.

# text with empty lines
for line in soup.find_all('td'):
    print(line.get_text().strip())
    print(line.string) # line with <span> prints None

# missing line <span>
for line in soup.find_all('td', text=re.compile(r'\w')):
    print(line.get_text().strip())

print(soup.prettify())

Am I doing something wrong? How could I use unwrap() and still access all of the text content without the empty lines?

Thanks for your help!

1 Answer 1

0

As I can test, you were near. Apply strip() and then use the re module to replace multiple spaces with only one, like:

from bs4 import BeautifulSoup
import re

html = ''' 
<table>
<tr class="title last ">
  <td>
   Lorem ipsum
  </td>
  <td>
  </td>
 </tr>
 <tr>
  <td>
   <span class="caps">dolor
   </span>
   sit amet
  </td>
  <td>
  </td>
 </tr>
 <tr>
  <td>
   consectetur adipiscing elit,
  </td>
  <td>
  </td>
 </tr>
 <tr>
  <td>
   sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
  </td>
  <td>
  </td>
 </tr>
 <tr>
  <td>
    Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.
  </td>
  <td>
  </td>
 </tr>
</table>
'''

soup = BeautifulSoup(html)

# remove <span> tag but keep content
spans = soup.find_all('span')
for tag in spans:
    tag.unwrap()

print('\n'.join(
  re.sub(r'\s+', ' ', td.text.strip()) 
    for td in soup.find_all('td') if td.text.strip()))

It yields:

Lorem ipsum
dolor sit amet
consectetur adipiscing elit,
sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.
Sign up to request clarification or add additional context in comments.

1 Comment

Great, thanks! If I may ask, what is the difference between td.text.strip() and td.get_text().strip()? Why is text=re.compile(r'\w')not matching the line with "dolor sit amet"?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.