2

I am scraping a webpage using Selenium in Python. I am able to locate the elements using this code:

from selenium import webdriver
import codecs

driver = webdriver.Chrome()
driver.get("url")
results_table=driver.find_elements_by_xpath('//*[@id="content"]/table[1]/tbody/tr')

Each element in results_table is in turn a set of sub-elements, with the number of sub-elements varying from element to element. My goal is to output each element, as a list or as a delimited string, into an output file. My code so far is this:

results_file=codecs.open(path+"results.txt","w","cp1252")

for element in enumerate(results_table):
    element_fields=element.find_elements_by_xpath(".//*[text()][count(*)=0]")
    element_list=[field.text for field in element_fields]
    stuff_to_write='#'.join(element_list)+"\r\n"
    results_file.write(stuff_to_write)
    #print (i)
results_file.close()
driver.quit()

This second part of code takes about 2.5 minutes on a list of ~400 elements, each with about 10 sub-elements. I get the desired output, but it is too slow. What could I do to improve the prformance ?

Using python 3.6

7
  • 2
    Download the whole page in one shot, then use something like BeautifulSoup to process it. I haven't used splinter or selenium in a while, but in splinter, <browser_object>.html will give you the page. I'm not sure what the syntax is for that in selenium, but there should be a way to grab the whole page. Commented Dec 6, 2017 at 7:23
  • I am using selenium because I need to scrapuktiple pages on a website where login is needed, and I would like to avoid logging in once for each page. BeautifulSoup is an option, but I do not know how toake it grab the active chromedriver page. And still, learning-wise, I must be doing something structurally wrong in my code Commented Dec 6, 2017 at 7:55
  • @horace_vr Does it speed up if you write to the file only once at the end, after the for loop instead of inside each iteration? Commented Dec 6, 2017 at 8:59
  • 2
    Selenium (and Splinter, which is layered on top of Selenium) are notoriously slow for randomly accessing web page content. Looks like driver.page_source may give the entire contents of the page in Selenium, which I found at stackoverflow.com/questions/35486374/…. If reading all the chunks on the page one at a time is killing your performance (and it probably is), reading the whole page once and processing it offline will be oodles faster. Commented Dec 6, 2017 at 13:31
  • 1
    @Gary02127 BeautifulSoup is the way to go; I tried it, based on your suggestion, and replaced the webdriver-based processing code, and instead of 2 minutes, the code is executed in a handful of seconds. If you elaborate and post an answer, I will accept it. It certainly answered my OP, although not a solution I had in mind when posting :) Commented Dec 6, 2017 at 21:37

1 Answer 1

1

Download the whole page in one shot, then use something like BeautifulSoup to process it. I haven't used splinter or selenium in a while, but in Splinter, .html will give you the page. I'm not sure what the syntax is for that in Selenium, but there should be a way to grab the whole page.

Selenium (and Splinter, which is layered on top of Selenium) are notoriously slow for randomly accessing web page content. Looks like .page_source may give the entire contents of the page in Selenium, which I found at stackoverflow.com/questions/35486374/…. If reading all the chunks on the page one at a time is killing your performance (and it probably is), reading the whole page once and processing it offline will be oodles faster.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.