0

I have multiple URLs to scrape stored in a csv file where each row is a separate URL and I'm using this code to run it

     def start\\_requests(self): 

             with open('csvfile', 'rb') as f: 

                      list=[] 

                      for line in f.readlines(): 

                             array = line.split(',')

                             url = array[9] 

                             list.append(url) 

                    list.pop(0)
             for url in list:
                    if url != "": 

                          yield scrapy.Request(url=url, callback=self.parse) 

It gives me the following error IndexError: list index out of range, can anyone help me correct this or suggest another way to use that csv file?

edit: csv file looks like this:

http://example.org/page1
http://example.org/page2

there are 9 such rows

3
  • Would it be possible to share some of your csv file to help find what the issue is. IndexError: list index out of range most likely suggests that the cause may be due to url = array[9] Commented Jul 20, 2020 at 18:09
  • It is literally a csv file where each row is an URL, no extra signs, no separators, nothing, and there are 9 rows for test purposes Commented Jul 20, 2020 at 18:12
  • Edited the question to show the csv file Commented Jul 20, 2020 at 18:18

1 Answer 1

1

You should be able to do this by reading the csv file into a list variable without having to do some of the code above. Therefore no need to split, pop and append

Working example

import csv
import scrapy
from scrapy.crawler import CrawlerProcess


class QuotesSpider(scrapy.Spider):
    name = "quotes"

    def start_requests(self):
        with open('websites.csv') as csv_file:
            data = csv.reader(csv_file)
            for row in data:
                # Supposing that the data is in the first column
                url = row[0]
                if url != "":
                    # We need to check this has the http prefix or we get a Missing scheme error
                    if not url.startswith('http://') and not url.startswith('https://'):
                        url = 'https://' + url
                    yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        # Do my data extraction
        print("test")


if __name__ == "__main__":
    c = CrawlerProcess({
        'USER_AGENT': 'Mozilla/5.0 (Windows NT 6.1; ) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36',
    })
    c.crawl(QuotesSpider)
    c.start()
Sign up to request clarification or add additional context in comments.

4 Comments

It almost works perfectly, since all the urls start with https it turned them into https:// http// example .com/site1 (without spaces), but after getting rid of part of the prefix check it works fine, thank you
You are correct, my prefix check should be an and. I'll update it now. Anyway, glad it worked
If it's not inconvenient to you would you mind explaining or point me towards a source with an explanation why last part is necessary? if name part
That part is just used to run a spider as a single python script and not via the scrapy crawl command. It is mentioned in the docs: docs.scrapy.org/en/latest/topics/…. the if __name__ == "__main__" part is general python. You can find an explanation here: stackoverflow.com/questions/28336627/if-name-main-python

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.