How to use several instance of selenium [python]

Question

I'm using selenium for web scraping but it's too slow so I'm trying to use to instance to speed it up.

What I'm trying to accomplish is:

1) create instance_1
2) create instance_2
3) Open a page in the first instance
do nothing
4) Open a page in the first instance
save the content of the first insctance
5) Open a new page in the first instance
save the content of the second instance

The idea is to use the time that takes the first page to load to open a second one.

links = ('https:my_page'+ '&LIC=' + code.split('_')[1] for code in data)

browser = webdriver.Firefox()
browser_2 = webdriver.Firefox()


first_link = links.next()
browser.get(first_link)
time.sleep(0.5)

for i,link in enumerate(links): 

        if i % 2:       # i starts at 0
            browser_2.get(link)
            time.sleep(0.5)
            try: 
                content = browser.page_source
                name = re.findall(re.findall('&LIC=(.+)&SAW',link)[0]
                with open(output_path  + name,'w') as output:
                    output.write((content_2))

                print 'error ' + str(i) 

        else:

            browser.get(link)
            time.sleep(0.5)
            try:
                content_2 = browser_2.page_source
                name = re.findall(re.findall('&LIC=(.+)&SAW',link)[0]
                with open(output_path  + name,'w') as output:
                    output.write((content ))

            except:
                print 'error ' + str(i)

But the script is waiting to the first page to charge completely before open open the next one, also this approach is bounded to only to page at the same time

EDIT.

I made the following changes to the code of GIRISH RAMNANI

Create the browser instances outside the function

driver_1 = webdriver.Firefox()
driver_2 = webdriver.Firefox()
driver_3 = webdriver.Firefox()

drivers_instance = [driver_1,driver_2,driver_3]

Use the driver and the url as input for the function

 def get_content(url,driver):    
    driver.get(url)
    tag = driver.find_element_by_tag_name("a")
    # do your work here and return the result
    return tag.get_attribute("href")

create a pair of link/ browser using the zip function

with ThreadPoolExecutor(max_workers=2) as ex:
    zip_list = zip(links, cycle(drivers_instance)) if len(links) > len(drivers_instance) else zip(cycle(links), drivers_instance)
    for par in zip_list:

       futures.append(ex.submit(get_content,par[0],par[1]))

You may achieve better results threading with a consumer/producer queue and worker functions. — xvan
– xvan, Commented Mar 17, 2016 at 3:16
you can use the multiprocessing module to create seperate Process for each browser — GIRISH RAMNANI
– GIRISH RAMNANI, Commented Mar 17, 2016 at 3:19
Have you tried not using Selenium at all? I mean, it is slow by nature because it is emulating a full-fledged browser. If the page you are trying to scrape isn't full of AJAX, a simpler approach using just plain requests/lxml (or BS4) or perhaps mechanize (if you need forms) should be a lot faster by default. You can also use the aforementioned tools with scrapy, in case you need to scrape a LOT of pages. — Gustavo Bezerra
– Gustavo Bezerra, Commented Mar 17, 2016 at 4:52
@GustavoBezerra Yes, I usually use scrapy, but in thios case I need to interact with the page in order to get the data. — Luis Ramon Ramirez Rodriguez
– Luis Ramon Ramirez Rodriguez, Commented Mar 17, 2016 at 14:39

GIRISH RAMNANI · Accepted Answer · 2016-03-17 06:51:08Z

9

use of concurrent.futures can be done here.

from selenium import webdriver
from concurrent.futures import ThreadPoolExecutor

URL ="https://pypi.python.org/pypi/{}"

li =["pywp/1.3","augploy/0.3.5"]

def get_content(url):    
    driver = webdriver.Firefox()
    driver.get(url)
    tag = driver.find_element_by_tag_name("a")
    # do your work here and return the result
    return tag.get_attribute("href")


li = list(map(lambda link: URL.format(link), li ))


futures = []
with ThreadPoolExecutor(max_workers=2) as ex:
    for link in li:

        futures.append(ex.submit(get_content,link))

for future in futures:
    print(future.result())

Keep in mind that two instances of firefox will start.

Note: you might want to use headless browsers such as PhantomJs instead of firefox.

edited Mar 17, 2016 at 6:51

answered Mar 17, 2016 at 6:31

GIRISH RAMNANI

6247 silver badges18 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Luis Ramon Ramirez Rodriguez Over a year ago

It almost work, but I've an issue, each time the function get_content is called it creates a new instance of the browser, so at each iteration I have to new windows. I guess I could close the browser but still will have the problem of the time it takes it to start. is possible to use the same two instance all the time?

Luis Ramon Ramirez Rodriguez Over a year ago

I made I few changes and now works fine, but I had to change generators for lists, do you have any suggestion to use the generators again or any other improvement

GIRISH RAMNANI Over a year ago

for the issue of idle instances , you can create a list of firefox instances with a contextmanager which on_enter removes the instance from the idle list and adds them on_exit. more info on contextmanager

Alex Over a year ago

@LuisRamonRamirezRodriguez -Hi, did you manage to implement the constant use of the same two instances of browsers, if so, how, I will be grateful for the answer?

Collectives™ on Stack Overflow

How to use several instance of selenium [python]

Create the browser instances outside the function

Use the driver and the url as input for the function

create a pair of link/ browser using the zip function

1 Answer 1

4 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

Create the browser instances outside the function

Use the driver and the url as input for the function

create a pair of link/ browser using the zip function

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related