3

I need to scrape a page who is using javascript. this is why I'm using Selenium. the problem is that selenium can't fetch the required data.

I want to use htmlXmlSelector to try to fetch the data.

how can I pass the html selenium produced to htmlXmlSelector ?

3 Answers 3

6

This is my solution: just create htmlXpathSelector from selenium page_source:

hxs = HtmlXPathSelector(text=sel.page_source)
Sign up to request clarification or add additional context in comments.

Comments

0

Try creating Response manually:

from scrapy.http import TextResponse
from scrapy.selector import HtmlXPathSelector

body = '''<html></html>'''

response = TextResponse(url = '', body = body, encoding = 'utf-8')

hxs = HtmlXPathSelector(response)
hxs.select("/html")

2 Comments

How does Selenium come into play? I did selenium.get(url). how do proceed?
I haven't used selenium, but i guess you can get page html source from it. Having the page body you create a response and then you can use HtmlXPathSelector on it.
0

Manual response with Selenium:

from scrapy.spider import BaseSpider
from scrapy.http import TextResponse
from scrapy.selector import HtmlXPathSelector
import time
from selenium import selenium

class DemoSpider(BaseSpider):
    name="Demo"
    allowed_domains = ['http://www.example.com']
    start_urls = ["http://www.example.com/demo"]

    def __init__(self):
        BaseSpider.__init__(self)
        self.selenium = selenium("127.0.0.1", 4444, "*chrome", self.start_urls[0])
        self.selenium.start()

    def __del__(self):
       self.selenium.stop()

    def parse (self, response):
        sel = self.selenium
        sel.open(response.url)
        time.sleep(2.0) # wait for javascript execution

        #build the response object from Selenium
        body = sel.get_html_source()
        sel_response = TextResponse(url=response.url, body=body, encoding = 'utf-8')
        hxs = HtmlXPathSelector(sel_response)
        hxs.select("//table").extract()

1 Comment

How do I make use of sel before this line body = sel.get_html_source(), I need to make an XPATH query and then based on returned elements and I need to e.click() them one by one and then download the get_html_source(), any idea how todo that? sel does not seem to have methods for xpath queries on content?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.