Error when using a csv file with URLs in scrapy python

Question

I have multiple URLs to scrape stored in a csv file where each row is a separate URL and I'm using this code to run it

     def start\\_requests(self): 

             with open('csvfile', 'rb') as f: 

                      list=[] 

                      for line in f.readlines(): 

                             array = line.split(',')

                             url = array[9] 

                             list.append(url) 

                    list.pop(0)
             for url in list:
                    if url != "": 

                          yield scrapy.Request(url=url, callback=self.parse)

It gives me the following error IndexError: list index out of range, can anyone help me correct this or suggest another way to use that csv file?

edit: csv file looks like this:

http://example.org/page1
http://example.org/page2

there are 9 such rows

Would it be possible to share some of your csv file to help find what the issue is. IndexError: list index out of range most likely suggests that the cause may be due to url = array[9] — Ryan
– Ryan, Commented Jul 20, 2020 at 18:09
It is literally a csv file where each row is an URL, no extra signs, no separators, nothing, and there are 9 rows for test purposes — A67John
– A67John, Commented Jul 20, 2020 at 18:12

Ryan · Accepted Answer · 2020-07-20 18:45:10Z

1

You should be able to do this by reading the csv file into a list variable without having to do some of the code above. Therefore no need to split, pop and append

Working example

import csv
import scrapy
from scrapy.crawler import CrawlerProcess


class QuotesSpider(scrapy.Spider):
    name = "quotes"

    def start_requests(self):
        with open('websites.csv') as csv_file:
            data = csv.reader(csv_file)
            for row in data:
                # Supposing that the data is in the first column
                url = row[0]
                if url != "":
                    # We need to check this has the http prefix or we get a Missing scheme error
                    if not url.startswith('http://') and not url.startswith('https://'):
                        url = 'https://' + url
                    yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        # Do my data extraction
        print("test")


if __name__ == "__main__":
    c = CrawlerProcess({
        'USER_AGENT': 'Mozilla/5.0 (Windows NT 6.1; ) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36',
    })
    c.crawl(QuotesSpider)
    c.start()

edited Jul 20, 2020 at 18:45

answered Jul 20, 2020 at 18:26

Ryan

2,1832 gold badges30 silver badges33 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

A67John Over a year ago

It almost works perfectly, since all the urls start with https it turned them into https:// http// example .com/site1 (without spaces), but after getting rid of part of the prefix check it works fine, thank you

Ryan Over a year ago

You are correct, my prefix check should be an and. I'll update it now. Anyway, glad it worked

A67John Over a year ago

If it's not inconvenient to you would you mind explaining or point me towards a source with an explanation why last part is necessary? if name part

Ryan Over a year ago

That part is just used to run a spider as a single python script and not via the scrapy crawl command. It is mentioned in the docs: docs.scrapy.org/en/latest/topics/…. the if __name__ == "__main__" part is general python. You can find an explanation here: stackoverflow.com/questions/28336627/if-name-main-python

Collectives™ on Stack Overflow

Error when using a csv file with URLs in scrapy python

1 Answer 1

4 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related