Python Data Scraper

Question

I wrote the following line of code

#!/usr/bin/python
#weather.scraper

from bs4 import BeautifulSoup
import urllib

def main():
    """weather scraper"""
    r = urllib.urlopen("https://www.wunderground.com/history/airport/KPHL/2016/1/1/MonthlyHistory.html?&reqdb.zip=&reqdb.magic=&reqdb.wmo=&MR=1").read()
    soup = BeautifulSoup(r, "html.parser")
    table = soup.find_all("table", class_="responsive airport-history-summary-table")
    tr = soup.find_all("tr")
    td = soup.find_all("td")
    print table
            

if __name__ == "__main__":
    main()

When I print the table i get all the html (td, tr, span, etc.) as well. How can I print the content of the table (tr, td) without the html?
THANKS!

Milano · Accepted Answer · 2016-04-04 20:16:42Z

2

You have to use .getText() method when you want to get a content. Since find_all returns a list of elements, you have to choose one of them (td[0]).

Or you can do for example:

for tr in soup.find_all("tr"):
    print '>>>> NEW row <<<<'
    print '|'.join([x.getText() for x in tr.find_all('td')])

The loop above prints for each row cell next to cell.

Note that you do find all td's and all tr's your way but you probably want to get just those in table.

If you want to look for elements inside the table, you have to do this:

table.find('tr') instead of soup.find('tr) so the BeautifulSoup will be looking for trs in the table instead of whole html.

YOUR CODE MODIFIED (according to your comment that there are more tables):

#!/usr/bin/python
#weather.scraper

from bs4 import BeautifulSoup
import urllib

def main():
    """weather scraper"""
    r = urllib.urlopen("https://www.wunderground.com/history/airport/KPHL/2016/1/1/MonthlyHistory.html?&reqdb.zip=&reqdb.magic=&reqdb.wmo=&MR=1").read()
    soup = BeautifulSoup(r, "html.parser")
    tables = soup.find_all("table")

    for table in tables:
        print '>>>>>>> NEW TABLE <<<<<<<<<'

        trs = table.find_all("tr")

        for tr in trs:
            # for each row of current table, write it using | between cells
            print '|'.join([x.get_text().replace('\n','') for x in tr.find_all('td')])



if __name__ == "__main__":
    main()

edited Apr 4, 2016 at 20:16

answered Apr 4, 2016 at 19:33

Milano

18.9k47 gold badges177 silver badges388 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

malina Over a year ago

Indeed you are write about tds in the table. but if I try tr = table.find('tr')` I get the following error: AttributeError: 'ResultSet' object has no attribute 'find'

Milano Over a year ago

Because it's a list. If you have just one table in html, you have to do soup.find('table'.... instead of soup.find_all('table....

malina Over a year ago

But in the complete html there are more tables, I am specifying my search by stating a class. I am not quite sure what you mean...

Milano Over a year ago

@malina Check my answer, I've edited there your code.

Milano Over a year ago

Now it's for all tables in the web page.

Collectives™ on Stack Overflow

Python Data Scraper

1 Answer 1

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related