I am web-scraping for textual data that comes in a table such as the following one, and I would like to obtain as the results:
Lorem ipsum
dolor sit amet
consectetur adipiscing elit,
sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.
html = '''
<table>
<tr class="title last ">
<td>
Lorem ipsum
</td>
<td>
</td>
</tr>
<tr>
<td>
<span class="caps">dolor
</span>
sit amet
</td>
<td>
</td>
</tr>
<tr>
<td>
consectetur adipiscing elit,
</td>
<td>
</td>
</tr>
<tr>
<td>
sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
</td>
<td>
</td>
</tr>
<tr>
<td>
Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.
</td>
<td>
</td>
</tr>
</table>
'''
I unwrapped the <span> element with beautifulsoup4 :
soup = BeautifulSoup(html)
# remove <span> tag but keep content
spans = soup.find_all('span')
for tag in spans:
tag.unwrap()
However, I come up with either empty lines for all the empty <td> elements, or the line 'dolor sit amet' does not print, even though I can see it when I print the html with prettify.
# text with empty lines
for line in soup.find_all('td'):
print(line.get_text().strip())
print(line.string) # line with <span> prints None
# missing line <span>
for line in soup.find_all('td', text=re.compile(r'\w')):
print(line.get_text().strip())
print(soup.prettify())
Am I doing something wrong? How could I use unwrap() and still access all of the text content without the empty lines?
Thanks for your help!