How to code a for loop in Python for a web scraper

Question

I'm writing a web scraping bot for the site AutoTrader, a popular car trading site in the UK and I'm trying to do as much as I can on my own, but I'm stuck as to how to get my script to do what I want.

Basically I want the bot to download certain information on the first 100 pages of listings for every car make and model, within a particular radius to my home. I also want the bot to stop trying to download the next pages of a particular brand/model car if there are no more new listings.

For instance if there are only 4 pages of listings and I ask it to download the listings on page 5, the web URL will automatically change to page 1, and the bot will download all the listings on page 1, then it would repeat this process for the next pages all the way up to page 100. Obviously I don't want 96 repeats of the cars on page 1 in my data set so I'd like to move onto the next model of car when this happens, but I haven't figured out a way to do that yet.

Here's what I have got so far:

for x in range(1, 101):
    makes = ["ABARTH", "AC", "AIXAM", "ARIEL", "ASTON%20MARTIN", "AUDI"]
    for make in makes:
        my_url_page_x_make_i = 'https://www.autotrader.co.uk/car-search?' + 'sort=distance' + '&postcode=BS247EY' + '&radius=300' + '&onesearchad=Used&onesearchad=Nearly%20New&onesearchad=New' + '&make=' + make + '&page=' + str(x)
        uClient = uReq(my_url_page_x_make_i)
        page_html = uClient.read()
        uClient.close()
        page_soup = soup(page_html, "html.parser")
        listings = page_soup.findAll("li", {"class": "search-page__result"})
        for listing in listings:
            information_container = listing.find("div", {"class": "information-container"})
            title_container = information_container.find("a", {
                "class": "js-click-handler listing-fpa-link tracking-standard-link"})
            title = title_container.text
            price = listing.find("div", {"class": "vehicle-price"}).text

            print("title: " + title)
            print("price: " + price)

            f.write(title.replace(",", "") + "," + price.replace(",", "") + "\n")
            if len(listings) < 13: makes.remove(make)

f.close()

This is far from a finished script and I only have about 1 week of real Python coding experience.

idownvotedbecau.se/toomuchcode Leave only what is relevant to page transitions since that's what you are asking about. — ivan_pozdeev
– ivan_pozdeev, Commented Feb 1, 2020 at 0:54
I think you can switch for i ... and for x ... in your code. To skip rest 96 pages, use break in for x... loop after the condition you have no pages found. — shimo
– shimo, Commented Feb 1, 2020 at 1:27
How would I have a condition that there are no pages found? because if there are 14 pages for instance and i ask for page 15 then the site will just load page 1 rather than an empty page saying page not found. — pvmlad
– pvmlad, Commented Feb 1, 2020 at 13:07
I edited the script a bit and cleaned it up to make it more readable — pvmlad
– pvmlad, Commented Feb 1, 2020 at 15:36

Fr3ddyDev · Accepted Answer · 2020-02-01 17:23:02Z

1

I think I've solved your problem, but I'd suggest you to invert your loops: loop on makes before you loop on the pages. Keeping your original implementation, I solved the problem by scraping the numbers of the pages from the bottom of the page, that way you can stop whenever you run out of pages. I also corrected BeautifulSoup.findAll to BeautifulSoup.find_all because assuming you're using BeautifulSoup version 4 that method is deprecated

# please show your imports
from urllib.request import urlopen
from bs4 import BeautifulSoup
# I assume you imported BeautifulSoup as soup and urlopen as uReq


# I assume you opened a file object
with open('output.txt', 'w') as f:
    # for the aston martin, if you want this to be scalable, escape url invalid
    # chars using urllib.parse.quote()
    makes = ["ABARTH", "AC", "AIXAM", "ARIEL", "ASTON%20MARTIN", "AUDI"]
    # make it clear what variables are
    for page in range(1, 101):  # here I tested it with 9 pages for speed sake
        for make in makes:
            # don't overcomplicate variable names; here I believe that an f-string would be appropriate
            req_url = f"https://www.autotrader.co.uk/car-search?sort=distance&" \
                      f"postcode=BS247EY&radius=300&onesearchad=Used&onesearchad=Nearly%20New&" \
                      f"onesearchad=New&make={make}&page={page}"
            req = urlopen(req_url)
            page_html = req.read()
            req.close()
            page_soup = BeautifulSoup(page_html, "html.parser")
            # BeautifulSoup.findAll is deprecated use find_all instead
            listings = page_soup.find_all("li", {"class": "search-page__result"})
            for listing in listings:
                information_container = listing.find("div", {"class": "information-container"})
                title_container = information_container.find("a", {
                    "class": "js-click-handler listing-fpa-link tracking-standard-link"})
                title = title_container.text
                price = listing.find("div", {"class": "vehicle-price"}).text
                print("make:", make)
                print("title:", title)
                print("price:", price)
                f.write(title.replace(",", "") + "," + price.replace(",", "") + "\n")
            # Solving your issue
            # we take the page numbers from the bottom of the page and take the last
            # actually here it's the last but one (-2) because the last element would
            # be the arrow.
            pagination = page_soup.find_all('li', {'class': 'pagination--li'})[-2]
            # convert it to int and compare it to the current page
            # if it's less than or equal to the current page, remove
            # the make from the list.
            if int(pagination.text) <= page:
                makes.remove(make)

answered Feb 1, 2020 at 17:23

Fr3ddyDev

4741 gold badge6 silver badges15 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

pvmlad Over a year ago

Thanks for taking a look at the script and helping me out with it, your comments have been really useful and I took your advice about swapping the loops so that it loops over make first and then page. The solution that you said about taking the page numbers from the bottom of the page is going to be a bit more difficult than you said, because if a particular car has more than 1 page of listings it lists the next pages that you can click onto at the bottom of the page next to the current page you're on. here is a link example:

pvmlad Over a year ago

autotrader.co.uk/… so this is on page 5, but at the bottom we can click to go straight to page 8 if we desire, we can also click to go back all the way to page 1

pvmlad Over a year ago

I was thinking another way to do this would be to compare all of the page links at the bottom, and take the current page as the one which does not have a hyperlink attached to it, so for instance, using the link above, we know that the current page is 5 because you can't click the number 5, to load page 5, since we're already on page 5.

pvmlad Over a year ago

On another note, I also wanted to make another loop for models on the cars, but the problem I am having is I'm not sure how to get the loop to only loop through the models that are relevant to each make. For instance there would be no need to search abarth as make and dbs as model since they do not make a dbs. I could simply have it run the script anyway but it would add a lot of time to the script since the majority of makes would be from other car manufacturers than the current one being searched for.

pvmlad Over a year ago

Here's a pastebin of what I have so far: pastebin.com/AnbsXHNv this is 100% of my script, there are no imports or anything that I have left out

|

Collectives™ on Stack Overflow

How to code a for loop in Python for a web scraper

1 Answer 1

7 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

7 Comments

Your Answer

Sign up or log in

Post as a guest

Related