Scrape Table HTML with beautifulSoup

Question

I'm trying to scrape a website which has been built with tables. Here a link of a page's example: http://www.rc2.vd.ch/registres/hrcintapp-pub/companyReport.action?rcentId=5947621600000055031025&lang=FR&showHeader=false

My goal is to get the name and the last name : Lass Christian (screenshot below).

[![enter image description here][1]][1] [1]: https://i.sstatic.net/q3nMb.png

I've already scraped many websites but this one I have absolutly no idea how to proceed. There are only 'tables' without any ID/Class tags and I can't figure out where I'm supposed to start.

Here's an exemple of the HTML code :

<table border="1" cellpadding="1" cellspacing="0" width="100%">
            <tbody><tr bgcolor="#f0eef2">
                
                <th colspan="3">Associés, gérants et personnes ayant qualité pour signer</th>
            </tr>
            <tr bgcolor="#f0eef2">
                
                <th>
                    <a class="hoverable" onclick="document.forms[0].rcentId.value='5947621600000055031025';document.forms[0].lang.value='FR';document.forms[0].searchLang.value='FR';document.forms[0].order.value='N';document.forms[0].rad.value='N';document.forms[0].goToAdm.value='true';document.forms[0].showHeader.value=false;document.forms[0].submit();event.returnValue=false; return false;">
                        Nom et Prénoms, Origine, Domicile, Part sociale
                    </a>
                    
                </th>
                <th>
                    <a class="hoverable" onclick="document.forms[0].rcentId.value='5947621600000055031025';document.forms[0].lang.value='FR';document.forms[0].searchLang.value='FR';document.forms[0].order.value='F';document.forms[0].rad.value='N';document.forms[0].goToAdm.value='true';document.forms[0].showHeader.value=false;document.forms[0].submit();event.returnValue=false; return false;">
                        Fonctions
                    </a>
                    
                        <img src="/registres/hrcintapp-pub/img/down_r.png" align="bottom" border="0" alt="">
                    
                </th>
                <th>Mode Signature</th>
            </tr>
            
                <tr bgcolor="#ffffff">
                    
                    
                    <td>
                        <span style="text-decoration: none;">
                            Lass Christian, du Danemark, à Yverdon-les-Bains, avec 200 parts de CHF 100
                        </span>
                    </td>
                    <td><span style="text-decoration: none;">associé gérant </span>&nbsp;</td>
                    
                    
                        <td><span style="text-decoration: none;">signature individuelle</span>&nbsp;</td>                   
                    
                    
                </tr>
            
            
            
            
        </tbody></table>

Yes I'd like to get Lass Christian, but not all the pages on the website are the same, sometimes there are more tables. So I want to find out a way to get the name for all kind of page. — jjyoh
– jjyoh, Commented Jul 22, 2016 at 20:18

Padraic Cunningham · Accepted Answer · 2016-07-22 20:28:16Z

2

This will get the name from the page, the table is right after the anchor with the id adm, once you have that you have numerous ways to get what you need:

from bs4 import BeautifulSoup
import requests

r = requests.get('http://www.rc2.vd.ch/registres/hrcintapp-pub/companyReport.action?rcentId=5947621600000055031025&lang=FR&showHeader=false')


soup = BeautifulSoup(r.content,"lxml")
table  = soup.select_one("#adm").find_next("table")
name = table.select_one("td span[style^=text-decoration:]").text.split(",", 1)[0].strip()
print(name)

Output:

Lass Christian

Or:

table = soup.select_one("#adm").find_next("table")
name = table.find("tr",bgcolor="#ffffff").td.span.text.split(",", 1)[0].strip()

edited Jul 22, 2016 at 20:28

answered Jul 22, 2016 at 20:21

Padraic Cunningham

181k30 gold badges264 silver badges327 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

John Over a year ago

Appreciate the nice answer ... increased my learning as well!

John · Accepted Answer · 2016-07-22 19:49:51Z

0

Something like this?

results = soup.find_all("tr", {"bgcolor" : "#ffffff"})
for result in results:
    the_name = result.td.span.get_text().split(',')[0]

answered Jul 22, 2016 at 19:49

John

16k10 gold badges76 silver badges114 bronze badges

2 Comments

jjyoh Over a year ago

Good idea ! But I don't get the second part : result.td.span.get_text().split(' , ')[0] ? It returns me AttributeError: 'NoneType' object has no attribute 'get_text'. What do you think ?

John Over a year ago

The idea is to look in the td element, then in the included span element. What that error means is that this tree wasn't found for one of the tr elements. Maybe add a print statement in there to see if you're finding any of them. Sorry I'm not in a place right now where I can test it, but I will be later.

Collectives™ on Stack Overflow

Scrape Table HTML with beautifulSoup

2 Answers 2

1 Comment

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related