Web scraping from multiple pages with for loop

Question

I have created web scraping tool for picking data from listed houses.

I have problem when it comes to changing page. I did make for loop to go from 1 to some number.

Problem is this: In this web pages last "page" can be different all the time. Now it is 70, but tomorrow it can be 68 or 72. And if I but range for example to (1-74) it will print last page many times, because if you go over the maximum the page always loads the last.

html: https://www.etuovi.com/myytavat-asunnot/oulu?haku=M1582971026&sivu=1000 <---- if you but this over the real number (70) of pages, it will automatically open the last page (70) as many times it is ranged.

So how to make this loop stop when it reaches maximum number?

for sivu in range(1, 100):
    
        req = requests.get(my_url + str(sivu))
        page_soup = soup(req.text, "html.parser")
        containers = page_soup.findAll("div", {"class": "ListPage__cardContainer__39dKQ"})

Thanks

Ricco D · Accepted Answer · 2020-12-21 06:33:43Z

2

Using the site you gave, you can get the maximum range by scraping the button texts.

import requests
from bs4 import BeautifulSoup as bs

url='https://www.etuovi.com/myytavat-asunnot/oulu?haku=M1582971026&sivu=1000'
page=requests.get(url)
soup = bs(page.content,'html.parser')

last_page = None
pages = []

buttons=soup.find_all('button', class_= "Pagination__button__3H2wX")
for button in buttons:
    pages.append(button.text)

print(pages)

Output: ['1', '68', '69', '70']

The last element will be the last page, I was able to get the buttons using class_= "Pagination__button__3H2wX". You can just get the last element of the array and use it as the limit of your loop. But take note that this might change depending on the web dev of the site whether he decides to change something on these buttons.

edited Dec 21, 2020 at 6:33

answered Dec 21, 2020 at 4:07

Ricco D

7,3451 gold badge10 silver badges19 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Joona Veteläinen · Accepted Answer · 2020-12-21 17:03:30Z

So here is my code now. For some reason I still can not get it going. Any ideas?

Error:

Traceback (most recent call last): File "C:/Users/Käyttäjä/PycharmProjects/Etuoviscaper/etuovi.py", line 29, in containers = page_soup.find("div", {"class": "ListPage__cardContainer__39dKQ"}) File "C:\Users\Käyttäjä\PycharmProjects\Etuoviscaper\venv\lib\site-packages\bs4\element.py", line 2173, in getattr raise AttributeError( AttributeError: ResultSet object has no attribute 'find'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?

import bs4
from bs4 import BeautifulSoup as soup
from urllib.request import urlopen as uReq
import re
import requests

my_url = 'https://www.etuovi.com/myytavat-asunnot/oulu?haku=M1582971026&sivu=1'

filename = "asunnot.csv"
f = open(filename, "w")
headers = "Neliöt; Hinta; Osoite; Kaupunginosa; Kaupunki; Huoneistoselitelmä; Rakennusvuosi\n"
f.write(headers)

page = requests.get(my_url)
soup = soup(page.content, 'html.parser')

pages = []
buttons = soup.findAll("button", {"class": "Pagination__button__3H2wX"})
for button in buttons:
    pages.append(button.text)


last_page = int(pages[-1])

for sivu in range(1, last_page):

    req = requests.get(my_url + str(sivu))
    page_soup = soup(req.text, "html.parser")
    containers = page_soup.findAll("div", {"class": "ListPage__cardContainer__39dKQ"})

    for container in containers:
        size_list = container.find("div", {"class": "flexboxgrid__col-xs__26GXk flexboxgrid__col-md-4__2DYW-"}).text
        size_number = re.findall("\d+\,*\d+", size_list)
        size = ''.join(size_number)  # Asunnon koko neliöinä

        prize_line = container.find("div", {"class": "flexboxgrid__col-xs-5__1-5sb flexboxgrid__col-md-4__2DYW-"}).text
        prize_number_list = re.findall("\d+\d+", prize_line)
        prize = ''.join(prize_number_list[:2])  # Asunnon hinta

        address_city = container.h4.text

        address_list = address_city.split(', ')[0:1]
        address = ' '.join(address_list)  # osoite

        city_part = address_city.split(', ')[-2]  # kaupunginosa

        city = address_city.split(', ')[-1]  # kaupunki

        type_org = container.h5.text
        type = type_org.replace("|", "").replace(",", "").replace(".", "")  # asuntotyyppi

        year_list = container.find("div", {"class": "flexboxgrid__col-xs-3__3Kf8r flexboxgrid__col-md-4__2DYW-"}).text
        year_number = re.findall("\d+", year_list)
        year = ' '.join(year_number)

        print("pinta-ala: " + size)
        print("hinta: " + prize)
        print("osoite: " + address)
        print("kaupunginosa: " + city_part)
        print("kaupunki: " + city)
        print("huoneistoselittelmä: " + type)
        print("rakennusvuosi: " + year)

        f.write(size + ";" + prize + ";" + address + ";" + city_part + ";" + city + ";" + type + ";" + year + "\n")

f.close()

Collectives™ on Stack Overflow

Web scraping from multiple pages with for loop

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related