Why do I have an infinite loop when using a middleware with Scrapy?

I am implementing a program to scrape job offers on a site. However, I have a problem: on this same site links are sometimes programmed with relative href, sometimes absolute (ex: sometimes I have https://fr.indeed.com => correct, but other times: /cmd/jobs => incorrect). It is therefore necessary to check before sending each request whether the request is correct. To do this, I coded a function in middleware that allows me to check it

class IndeedDomainVerificationMiddleware:
 
    def __init__(self, indeed_domain_url):
        self.indeed_domain_url = indeed_domain_url
 
    @classmethod
    def from_crawler(cls, crawler):
        return cls(indeed_domain_url=crawler.settings.get('INDEED_DOMAIN_URL'))
 
    def process_request(self, request, spider):
        if spider.name == 'Indeed':
            print(request.url)
            if self.indeed_domain_url not in request.url:
                new_url = self.indeed_domain_url + request.url
                print("New URL: %s" % new_url)
                return request.replace(url=new_url)
            else:
                pass
        return None

In middlewares.py

Then, in my settings I activate the middleware :

SCRAPEOPS_API_KEY = 'API KEY'
SCRAPEOPS_PROXY_ENABLED = True
SCRAPEOPS_PROXY_SETTINGS = {'country': 'fr'}
INDEED_DOMAIN_URL = "https://fr.indeed.com"
DOWNLOADER_MIDDLEWARES = {
   "Cy_Scraper.middlewares.IndeedDomainVerificationMiddleware": 543,
   "Cy_Scraper.middlewares.ScrapeOpsProxyMiddleware": 725,
}

I use a second middleware that allows me to use a proxy to avoid being detected.

However when I launch my spider I have an infinite loop.

I tried various things as implementing the middleware as a Spider middleware like shown there, but it didnt work, I kept having an infinite loop when returning the request, and having duplicate request when I return none

Saw the post from Rahul but it didn't help too.

asked Jan 17, 2024 at 9:19

Kevin

212 bronze badges

Add a comment |

0 Your Answer

Sign up or log in

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.

Collectives™ on Stack Overflow

Why do I have an infinite loop when using a middleware with Scrapy?

0

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Your Answer

Sign up or log in

Post as a guest

Linked