1

I need to loop all the entries of all the pages from this link, then click the menu check in the red part (please see the image below) to enter the detail of each entry:

enter image description here

The objective is to cralwer the infos from the pages such as image below, and save left part as column names and right part as rows:

enter image description here

The code I used:

import requests
import json
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup

url = 'http://bjjs.zjw.beijing.gov.cn/eportal/ui?pageId=425000'
content = requests.get(url).text
soup = BeautifulSoup(content, 'lxml')
table = soup.find('table', {'class': 'gridview'})
df = pd.read_html(str(table))[0]
print(df.head(5))

Out:

   序号               工程名称  ...        发证日期 详细信息
0 NaN  假日万恒社区卫生服务站装饰装修工程  ...  2020-07-07   查看

The code for entering the detailed pages:

url = 'http://bjjs.zjw.beijing.gov.cn/eportal/ui?pageId=308891&t=toDetail&GCBM=202006202001'

content = requests.get(url).text
soup = BeautifulSoup(content, 'lxml')

table = soup.find("table", attrs={"class":"detailview"}).findAll("tr")
 
for elements in table:
    inner_elements = elements.findAll("td", attrs={"class":"label"})
    for text_for_elements in inner_elements:
        print(text_for_elements.text)

Out:

        工程名称:
        施工许可证号:
        所在区县:
        建设单位:
        工程规模(平方米):
        发证日期:
        建设地址:
        施工单位:
        监理单位:
        设计单位:
        行政相对人代码:
        法定代表人姓名:
        许可机关:

As you can see, I only get column name, no entries have been successfully extracted.

In order to loop all pages, I think we need to use post requests, but I don't know how to get headers.

Thanks for your help at advance.

1 Answer 1

1

This script will go for all pages and gets the data into a DataFrame and saves them to data.csv.

(!!! Warning !!! there are 2405 pages total, so it takes a long time to get them all):

import requests
import pandas as pd
from pprint import pprint
from bs4 import BeautifulSoup


url = 'http://bjjs.zjw.beijing.gov.cn/eportal/ui?pageId=425000'
payload = {'currentPage': 1, 'pageSize':15}

def scrape_page(url):
    soup = BeautifulSoup(requests.get(url).content, 'html.parser')
    return {td.get_text(strip=True).replace(':', ''): td.find_next('td').get_text(strip=True) for td in soup.select('td.label')}


all_data = []
current_page = 1

while True:
    print('Page {}...'.format(current_page))
    payload['currentPage'] = current_page
    soup = BeautifulSoup(requests.post(url, data=payload).content, 'html.parser')
    for a in soup.select('a:contains("查看")'):
        u = 'http://bjjs.zjw.beijing.gov.cn' + a['href']
        d = scrape_page(u)
        all_data.append(d)
        pprint(d)

    page_next = soup.select_one('a:contains("下一页")[onclick]')
    if not page_next:
        break

    current_page += 1

df = pd.DataFrame(all_data)
df.to_csv('data.csv')

Prints the data to screen and saves data.csv (screenshot from LibreOffice):

enter image description here

Sign up to request clarification or add additional context in comments.

2 Comments

Sorry, one more question, if we want to crawler only top 100 pages contents, how to change your code?
@ahbon inside de while loop check current_page if it's greater than 100 and if it's then do break.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.