Iterate all pages and crawler table's elements save as dataframe in Python

Question

I need to loop all the entries of all the pages from this link, then click the menu check in the red part (please see the image below) to enter the detail of each entry:

The objective is to cralwer the infos from the pages such as image below, and save left part as column names and right part as rows:

The code I used:

import requests
import json
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup

url = 'http://bjjs.zjw.beijing.gov.cn/eportal/ui?pageId=425000'
content = requests.get(url).text
soup = BeautifulSoup(content, 'lxml')
table = soup.find('table', {'class': 'gridview'})
df = pd.read_html(str(table))[0]
print(df.head(5))

Out:

   序号               工程名称  ...        发证日期 详细信息
0 NaN  假日万恒社区卫生服务站装饰装修工程  ...  2020-07-07   查看

The code for entering the detailed pages:

url = 'http://bjjs.zjw.beijing.gov.cn/eportal/ui?pageId=308891&t=toDetail&GCBM=202006202001'

content = requests.get(url).text
soup = BeautifulSoup(content, 'lxml')

table = soup.find("table", attrs={"class":"detailview"}).findAll("tr")
 
for elements in table:
    inner_elements = elements.findAll("td", attrs={"class":"label"})
    for text_for_elements in inner_elements:
        print(text_for_elements.text)

Out:

        工程名称：
        施工许可证号：
        所在区县：
        建设单位：
        工程规模(平方米)：
        发证日期：
        建设地址：
        施工单位：
        监理单位：
        设计单位：
        行政相对人代码：
        法定代表人姓名：
        许可机关：

As you can see, I only get column name, no entries have been successfully extracted.

In order to loop all pages, I think we need to use post requests, but I don't know how to get headers.

Thanks for your help at advance.

Andrej Kesely · Accepted Answer · 2020-07-07 21:52:13Z

1

This script will go for all pages and gets the data into a DataFrame and saves them to data.csv.

(!!! Warning !!! there are 2405 pages total, so it takes a long time to get them all):

import requests
import pandas as pd
from pprint import pprint
from bs4 import BeautifulSoup


url = 'http://bjjs.zjw.beijing.gov.cn/eportal/ui?pageId=425000'
payload = {'currentPage': 1, 'pageSize':15}

def scrape_page(url):
    soup = BeautifulSoup(requests.get(url).content, 'html.parser')
    return {td.get_text(strip=True).replace('：', ''): td.find_next('td').get_text(strip=True) for td in soup.select('td.label')}


all_data = []
current_page = 1

while True:
    print('Page {}...'.format(current_page))
    payload['currentPage'] = current_page
    soup = BeautifulSoup(requests.post(url, data=payload).content, 'html.parser')
    for a in soup.select('a:contains("查看")'):
        u = 'http://bjjs.zjw.beijing.gov.cn' + a['href']
        d = scrape_page(u)
        all_data.append(d)
        pprint(d)

    page_next = soup.select_one('a:contains("下一页")[onclick]')
    if not page_next:
        break

    current_page += 1

df = pd.DataFrame(all_data)
df.to_csv('data.csv')

Prints the data to screen and saves data.csv (screenshot from LibreOffice):

answered Jul 7, 2020 at 21:52

Andrej Kesely

196k15 gold badges60 silver badges105 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

ah bon Over a year ago

Sorry, one more question, if we want to crawler only top 100 pages contents, how to change your code?

Andrej Kesely Over a year ago

@ahbon inside de while loop check current_page if it's greater than 100 and if it's then do break.

Collectives™ on Stack Overflow

Iterate all pages and crawler table's elements save as dataframe in Python

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related