0

I have created web scraping tool for picking data from listed houses.

I have problem when it comes to changing page. I did make for loop to go from 1 to some number.

Problem is this: In this web pages last "page" can be different all the time. Now it is 70, but tomorrow it can be 68 or 72. And if I but range for example to (1-74) it will print last page many times, because if you go over the maximum the page always loads the last.

html: https://www.etuovi.com/myytavat-asunnot/oulu?haku=M1582971026&sivu=1000 <---- if you but this over the real number (70) of pages, it will automatically open the last page (70) as many times it is ranged.

So how to make this loop stop when it reaches maximum number?

for sivu in range(1, 100):
    
        req = requests.get(my_url + str(sivu))
        page_soup = soup(req.text, "html.parser")
        containers = page_soup.findAll("div", {"class": "ListPage__cardContainer__39dKQ"})

Thanks

2 Answers 2

2

Using the site you gave, you can get the maximum range by scraping the button texts.

enter image description here

import requests
from bs4 import BeautifulSoup as bs

url='https://www.etuovi.com/myytavat-asunnot/oulu?haku=M1582971026&sivu=1000'
page=requests.get(url)
soup = bs(page.content,'html.parser')

last_page = None
pages = []

buttons=soup.find_all('button', class_= "Pagination__button__3H2wX")
for button in buttons:
    pages.append(button.text)

print(pages)

Output: ['1', '68', '69', '70']

The last element will be the last page, I was able to get the buttons using class_= "Pagination__button__3H2wX". You can just get the last element of the array and use it as the limit of your loop. But take note that this might change depending on the web dev of the site whether he decides to change something on these buttons.

Sign up to request clarification or add additional context in comments.

Comments

0

So here is my code now. For some reason I still can not get it going. Any ideas?

Error:

Traceback (most recent call last): File "C:/Users/Käyttäjä/PycharmProjects/Etuoviscaper/etuovi.py", line 29, in containers = page_soup.find("div", {"class": "ListPage__cardContainer__39dKQ"}) File "C:\Users\Käyttäjä\PycharmProjects\Etuoviscaper\venv\lib\site-packages\bs4\element.py", line 2173, in getattr raise AttributeError( AttributeError: ResultSet object has no attribute 'find'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?

import bs4
from bs4 import BeautifulSoup as soup
from urllib.request import urlopen as uReq
import re
import requests

my_url = 'https://www.etuovi.com/myytavat-asunnot/oulu?haku=M1582971026&sivu=1'

filename = "asunnot.csv"
f = open(filename, "w")
headers = "Neliöt; Hinta; Osoite; Kaupunginosa; Kaupunki; Huoneistoselitelmä; Rakennusvuosi\n"
f.write(headers)

page = requests.get(my_url)
soup = soup(page.content, 'html.parser')

pages = []
buttons = soup.findAll("button", {"class": "Pagination__button__3H2wX"})
for button in buttons:
    pages.append(button.text)


last_page = int(pages[-1])

for sivu in range(1, last_page):

    req = requests.get(my_url + str(sivu))
    page_soup = soup(req.text, "html.parser")
    containers = page_soup.findAll("div", {"class": "ListPage__cardContainer__39dKQ"})

    for container in containers:
        size_list = container.find("div", {"class": "flexboxgrid__col-xs__26GXk flexboxgrid__col-md-4__2DYW-"}).text
        size_number = re.findall("\d+\,*\d+", size_list)
        size = ''.join(size_number)  # Asunnon koko neliöinä

        prize_line = container.find("div", {"class": "flexboxgrid__col-xs-5__1-5sb flexboxgrid__col-md-4__2DYW-"}).text
        prize_number_list = re.findall("\d+\d+", prize_line)
        prize = ''.join(prize_number_list[:2])  # Asunnon hinta

        address_city = container.h4.text

        address_list = address_city.split(', ')[0:1]
        address = ' '.join(address_list)  # osoite

        city_part = address_city.split(', ')[-2]  # kaupunginosa

        city = address_city.split(', ')[-1]  # kaupunki

        type_org = container.h5.text
        type = type_org.replace("|", "").replace(",", "").replace(".", "")  # asuntotyyppi

        year_list = container.find("div", {"class": "flexboxgrid__col-xs-3__3Kf8r flexboxgrid__col-md-4__2DYW-"}).text
        year_number = re.findall("\d+", year_list)
        year = ' '.join(year_number)

        print("pinta-ala: " + size)
        print("hinta: " + prize)
        print("osoite: " + address)
        print("kaupunginosa: " + city_part)
        print("kaupunki: " + city)
        print("huoneistoselittelmä: " + type)
        print("rakennusvuosi: " + year)

        f.write(size + ";" + prize + ";" + address + ";" + city_part + ";" + city + ";" + type + ";" + year + "\n")

f.close()

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.