How to pass Selenium html page to htmlXpathSelector

Question

I need to scrape a page who is using javascript. this is why I'm using Selenium. the problem is that selenium can't fetch the required data.

I want to use htmlXmlSelector to try to fetch the data.

how can I pass the html selenium produced to htmlXmlSelector ?

warvariuc · Accepted Answer · 2012-07-27 16:43:55Z

6

This is my solution: just create htmlXpathSelector from selenium page_source:

hxs = HtmlXPathSelector(text=sel.page_source)

edited Jul 27, 2012 at 16:43

warvariuc

60.1k45 gold badges183 silver badges234 bronze badges

answered Jul 27, 2012 at 15:37

DjangoPy

8751 gold badge13 silver badges29 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

warvariuc · Accepted Answer · 2012-07-27 10:57:01Z

0

Try creating Response manually:

from scrapy.http import TextResponse
from scrapy.selector import HtmlXPathSelector

body = '''<html></html>'''

response = TextResponse(url = '', body = body, encoding = 'utf-8')

hxs = HtmlXPathSelector(response)
hxs.select("/html")

answered Jul 27, 2012 at 10:57

warvariuc

60.1k45 gold badges183 silver badges234 bronze badges

2 Comments

DjangoPy Over a year ago

How does Selenium come into play? I did selenium.get(url). how do proceed?

warvariuc Over a year ago

I haven't used selenium, but i guess you can get page html source from it. Having the page body you create a response and then you can use HtmlXPathSelector on it.

samuel5 · Accepted Answer · 2013-01-05 23:39:25Z

0

Manual response with Selenium:

from scrapy.spider import BaseSpider
from scrapy.http import TextResponse
from scrapy.selector import HtmlXPathSelector
import time
from selenium import selenium

class DemoSpider(BaseSpider):
    name="Demo"
    allowed_domains = ['http://www.example.com']
    start_urls = ["http://www.example.com/demo"]

    def __init__(self):
        BaseSpider.__init__(self)
        self.selenium = selenium("127.0.0.1", 4444, "*chrome", self.start_urls[0])
        self.selenium.start()

    def __del__(self):
       self.selenium.stop()

    def parse (self, response):
        sel = self.selenium
        sel.open(response.url)
        time.sleep(2.0) # wait for javascript execution

        #build the response object from Selenium
        body = sel.get_html_source()
        sel_response = TextResponse(url=response.url, body=body, encoding = 'utf-8')
        hxs = HtmlXPathSelector(sel_response)
        hxs.select("//table").extract()

answered Jan 5, 2013 at 23:39

samuel5

1961 silver badge5 bronze badges

1 Comment

Mo J. Mughrabi Over a year ago

How do I make use of sel before this line body = sel.get_html_source(), I need to make an XPATH query and then based on returned elements and I need to e.click() them one by one and then download the get_html_source(), any idea how todo that? sel does not seem to have methods for xpath queries on content?

Collectives™ on Stack Overflow

How to pass Selenium html page to htmlXpathSelector

3 Answers 3

Comments

2 Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

2 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related