Unwrap element with beautifulsoup4: does it affect the .string of parent element?

Question

I am web-scraping for textual data that comes in a table such as the following one, and I would like to obtain as the results:

Lorem ipsum
dolor sit amet
consectetur adipiscing elit,
sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.

    html = '''
<table>
<tr class="title last ">
  <td>
   Lorem ipsum
  </td>
  <td>
  </td>
 </tr>
 <tr>
  <td>
   <span class="caps">dolor
   </span>
   sit amet
  </td>
  <td>
  </td>
 </tr>
 <tr>
  <td>
   consectetur adipiscing elit,
  </td>
  <td>
  </td>
 </tr>
 <tr>
  <td>
   sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
  </td>
  <td>
  </td>
 </tr>
 <tr>
  <td>
    Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.
  </td>
  <td>
  </td>
 </tr>
</table>
'''

I unwrapped the <span> element with beautifulsoup4 :

soup = BeautifulSoup(html)

# remove <span> tag but keep content
spans = soup.find_all('span')
for tag in spans:
    tag.unwrap()

However, I come up with either empty lines for all the empty <td> elements, or the line 'dolor sit amet' does not print, even though I can see it when I print the html with prettify.

# text with empty lines
for line in soup.find_all('td'):
    print(line.get_text().strip())
    print(line.string) # line with <span> prints None

# missing line <span>
for line in soup.find_all('td', text=re.compile(r'\w')):
    print(line.get_text().strip())

print(soup.prettify())

Am I doing something wrong? How could I use unwrap() and still access all of the text content without the empty lines?

Thanks for your help!

Birei · Accepted Answer · 2015-02-15 20:38:25Z

0

As I can test, you were near. Apply strip() and then use the re module to replace multiple spaces with only one, like:

from bs4 import BeautifulSoup
import re

html = ''' 
<table>
<tr class="title last ">
  <td>
   Lorem ipsum
  </td>
  <td>
  </td>
 </tr>
 <tr>
  <td>
   <span class="caps">dolor
   </span>
   sit amet
  </td>
  <td>
  </td>
 </tr>
 <tr>
  <td>
   consectetur adipiscing elit,
  </td>
  <td>
  </td>
 </tr>
 <tr>
  <td>
   sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
  </td>
  <td>
  </td>
 </tr>
 <tr>
  <td>
    Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.
  </td>
  <td>
  </td>
 </tr>
</table>
'''

soup = BeautifulSoup(html)

# remove <span> tag but keep content
spans = soup.find_all('span')
for tag in spans:
    tag.unwrap()

print('\n'.join(
  re.sub(r'\s+', ' ', td.text.strip()) 
    for td in soup.find_all('td') if td.text.strip()))

It yields:

Lorem ipsum
dolor sit amet
consectetur adipiscing elit,
sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.

answered Feb 15, 2015 at 20:38

Birei

36.4k3 gold badges80 silver badges84 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Ely Over a year ago

Great, thanks! If I may ask, what is the difference between td.text.strip() and td.get_text().strip()? Why is text=re.compile(r'\w')not matching the line with "dolor sit amet"?

Collectives™ on Stack Overflow

Unwrap element with beautifulsoup4: does it affect the .string of parent element?

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related