I am implementing a program to scrape job offers on a site. However, I have a problem: on this same site links are sometimes programmed with relative href, sometimes absolute (ex: sometimes I have https://fr.indeed.com => correct, but other times: /cmd/jobs => incorrect). It is therefore necessary to check before sending each request whether the request is correct. To do this, I coded a function in middleware that allows me to check it
class IndeedDomainVerificationMiddleware:
def __init__(self, indeed_domain_url):
self.indeed_domain_url = indeed_domain_url
@classmethod
def from_crawler(cls, crawler):
return cls(indeed_domain_url=crawler.settings.get('INDEED_DOMAIN_URL'))
def process_request(self, request, spider):
if spider.name == 'Indeed':
print(request.url)
if self.indeed_domain_url not in request.url:
new_url = self.indeed_domain_url + request.url
print("New URL: %s" % new_url)
return request.replace(url=new_url)
else:
pass
return None
In middlewares.py
Then, in my settings I activate the middleware :
SCRAPEOPS_API_KEY = 'API KEY'
SCRAPEOPS_PROXY_ENABLED = True
SCRAPEOPS_PROXY_SETTINGS = {'country': 'fr'}
INDEED_DOMAIN_URL = "https://fr.indeed.com"
DOWNLOADER_MIDDLEWARES = {
"Cy_Scraper.middlewares.IndeedDomainVerificationMiddleware": 543,
"Cy_Scraper.middlewares.ScrapeOpsProxyMiddleware": 725,
}
I use a second middleware that allows me to use a proxy to avoid being detected.
However when I launch my spider I have an infinite loop.
I tried various things as implementing the middleware as a Spider middleware like shown there, but it didnt work, I kept having an infinite loop when returning the request, and having duplicate request when I return none
Saw the post from Rahul but it didn't help too.