0

I am saving the output of web scrawling using scrapy in a csv file. The crawling itself seems to be working correctly, but I am not happy with the format of the output saved in csv file. I crawl 20 webpages where each page contains 100 job titles and their respective urls. So I am expecting the output looking like this:

url1, title1
url2, title2
...
...
url1999, title1999
url2000, title2000

however, the actual output in csv looks like this:

url1 url2 ... url100, title1 title2 ... title100
url101 url02 ... url200, title101 title102 ... title200
...
url1901 url902 ... url2000, title1901 title1902 ... title2000

My Spider code is:

import scrapy

class TextPostItem(scrapy.Item):
    title = scrapy.Field()
    link = scrapy.Field()

class MySpider(scrapy.Spider):
    name = "craig_spider"
    allowed_domains = ["craigslist.org"]
    start_urls = ["http://sfbay.craigslist.org/search/npo"]

    def parse(self, response):
        number = 0
        for page in range(0, 20):
            yield scrapy.Request("http://sfbay.craigslist.org/search/npo?=%s" % number, callback=self.parse_item, dont_filter=True)
            number += 100

    def parse_item(self, response):
        item = TextPostItem()
        item['title'] =response.xpath("//span[@class='pl']/a/text()").extract()
        item['link'] = response.xpath("//span[@class='pl']/a/@href").extract()
        return item

My csv code is:

scrapy crawl craig_spider -o craig.csv -t csv

Any suggestion? Thanks.

2
  • 1
    What is your csv code? Commented Sep 23, 2015 at 19:57
  • My csv code is: scrapy crawl craig_spider -o craig.csv -t csv Commented Sep 23, 2015 at 19:59

1 Answer 1

2

The problem is that you get a response with multiple //span[@class='pl']/a/ fields back, loading every text() into a list and assigning that to item['title'], and then loading every @href into a list and assigning that to item['link'].

In otherwords for the first response, you are essentially doing the following:

item['title'] = [title1, title2, ..., title100]
item['link'] = [url1, url2, ..., url100]

So, that's being sent to CSV as:

title,link
[title1, title2, ..., title100],[url1, url2, ..., url100]

To fix this, loop through each //span[@class='pl']/a/ and have individual items for each.

def parse_item(self, response):
    for span in response.xpath("//span[@class='pl']/a"):
        item = TextPostItem()
        item['title'] = span.xpath(".//text()").extract()
        item['link'] = span.xpath(".//@href").extract()
        yield item
Sign up to request clarification or add additional context in comments.

1 Comment

Thank you for the suggestion! Appreciated!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.