0

I am trying to make a web crawler to pull some information from Yahoo Finance as a personal Project. However, on the analysis page of Yahoo finance I can't pull a particular value. The HTML seems complicated to me, could I get some guidance?

class yhcrawler(scrapy.Spider):
    name = 'yahoo'
    
    start_urls = [f'https://ca.finance.yahoo.com/quote/{t}/analysis?p={t}' for t in tkrs]
    
    def parse(self, response):
        filename = 'stock_growths.csv'
        
        l = response.css('div#YDC-Col1>div>div>div>div>div>section>table>tbody>tr>td#431::text').extract()
        print(l)

this is what I am trying

l = response.css('div#YDC-Col1>div>div>div>div>div>section>table>tbody>tr>td#431::text').extract()

and I am getting an empty results of

2021-04-18 15:12:54 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://ca.finance.yahoo.com/quote/M/analysis?p=M> (referer: None)
[]

The value I am trying to get is on the highlighted line, -11.82% enter image description here

4
  • You wanna specify the exact value of an item available in that site in order for others to help you. Commented Apr 18, 2021 at 22:57
  • @SIM i added the value, -11.82%. Please advise Commented Apr 19, 2021 at 2:46
  • I don't know which ticker you are using, so the value in the image is useless. What value you wish to grab, if you consider this link? Beware that the value in there are not static, so specify by the field name, as in Current Year, Next Year e.t.c. Commented Apr 19, 2021 at 4:10
  • @sim growth estimate for the next 5 years Commented Apr 19, 2021 at 7:34

1 Answer 1

1

Try this:

class YahoofinanceSpider(scrapy.Spider):
    name = 'yahoofinance'
    start_urls = ['https://ca.finance.yahoo.com/quote/aapl/analysis?p=aapl']
    
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36'
    } 

    def start_requests(self):
        for start_url in self.start_urls:
            yield scrapy.Request(start_url,headers=self.headers)

    def parse(self, response):
        item = response.xpath("//td[./span][contains(.,'Next 5 Years')]/following-sibling::td/text()").getall()
        yield {"item":item}
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.