3

I'm using selenium for web scraping but it's too slow so I'm trying to use to instance to speed it up.

What I'm trying to accomplish is:

1) create instance_1
2) create instance_2
3) Open a page in the first instance
do nothing
4) Open a page in the first instance
save the content of the first insctance
5) Open a new page in the first instance
save the content of the second instance

The idea is to use the time that takes the first page to load to open a second one.

links = ('https:my_page'+ '&LIC=' + code.split('_')[1] for code in data)

browser = webdriver.Firefox()
browser_2 = webdriver.Firefox()


first_link = links.next()
browser.get(first_link)
time.sleep(0.5)

for i,link in enumerate(links): 

        if i % 2:       # i starts at 0
            browser_2.get(link)
            time.sleep(0.5)
            try: 
                content = browser.page_source
                name = re.findall(re.findall('&LIC=(.+)&SAW',link)[0]
                with open(output_path  + name,'w') as output:
                    output.write((content_2))

                print 'error ' + str(i) 

        else:

            browser.get(link)
            time.sleep(0.5)
            try:
                content_2 = browser_2.page_source
                name = re.findall(re.findall('&LIC=(.+)&SAW',link)[0]
                with open(output_path  + name,'w') as output:
                    output.write((content ))

            except:
                print 'error ' + str(i) 

But the script is waiting to the first page to charge completely before open open the next one, also this approach is bounded to only to page at the same time

EDIT.

I made the following changes to the code of GIRISH RAMNANI

Create the browser instances outside the function

driver_1 = webdriver.Firefox()
driver_2 = webdriver.Firefox()
driver_3 = webdriver.Firefox()

drivers_instance = [driver_1,driver_2,driver_3]

Use the driver and the url as input for the function

 def get_content(url,driver):    
    driver.get(url)
    tag = driver.find_element_by_tag_name("a")
    # do your work here and return the result
    return tag.get_attribute("href")

create a pair of link/ browser using the zip function

with ThreadPoolExecutor(max_workers=2) as ex:
    zip_list = zip(links, cycle(drivers_instance)) if len(links) > len(drivers_instance) else zip(cycle(links), drivers_instance)
    for par in zip_list:

       futures.append(ex.submit(get_content,par[0],par[1]))
4
  • You may achieve better results threading with a consumer/producer queue and worker functions. Commented Mar 17, 2016 at 3:16
  • you can use the multiprocessing module to create seperate Process for each browser Commented Mar 17, 2016 at 3:19
  • Have you tried not using Selenium at all? I mean, it is slow by nature because it is emulating a full-fledged browser. If the page you are trying to scrape isn't full of AJAX, a simpler approach using just plain requests/lxml (or BS4) or perhaps mechanize (if you need forms) should be a lot faster by default. You can also use the aforementioned tools with scrapy, in case you need to scrape a LOT of pages. Commented Mar 17, 2016 at 4:52
  • 2
    @GustavoBezerra Yes, I usually use scrapy, but in thios case I need to interact with the page in order to get the data. Commented Mar 17, 2016 at 14:39

1 Answer 1

9

use of concurrent.futures can be done here.

from selenium import webdriver
from concurrent.futures import ThreadPoolExecutor

URL ="https://pypi.python.org/pypi/{}"

li =["pywp/1.3","augploy/0.3.5"]

def get_content(url):    
    driver = webdriver.Firefox()
    driver.get(url)
    tag = driver.find_element_by_tag_name("a")
    # do your work here and return the result
    return tag.get_attribute("href")


li = list(map(lambda link: URL.format(link), li ))


futures = []
with ThreadPoolExecutor(max_workers=2) as ex:
    for link in li:

        futures.append(ex.submit(get_content,link))

for future in futures:
    print(future.result())

Keep in mind that two instances of firefox will start.

Note: you might want to use headless browsers such as PhantomJs instead of firefox.

Sign up to request clarification or add additional context in comments.

4 Comments

It almost work, but I've an issue, each time the function get_content is called it creates a new instance of the browser, so at each iteration I have to new windows. I guess I could close the browser but still will have the problem of the time it takes it to start. is possible to use the same two instance all the time?
I made I few changes and now works fine, but I had to change generators for lists, do you have any suggestion to use the generators again or any other improvement
for the issue of idle instances , you can create a list of firefox instances with a contextmanager which on_enter removes the instance from the idle list and adds them on_exit. more info on contextmanager
@LuisRamonRamirezRodriguez -Hi, did you manage to implement the constant use of the same two instances of browsers, if so, how, I will be grateful for the answer?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.