I'm using selenium for web scraping but it's too slow so I'm trying to use to instance to speed it up.
What I'm trying to accomplish is:
1) create instance_1
2) create instance_2
3) Open a page in the first instance
do nothing
4) Open a page in the first instance
save the content of the first insctance
5) Open a new page in the first instance
save the content of the second instance
The idea is to use the time that takes the first page to load to open a second one.
links = ('https:my_page'+ '&LIC=' + code.split('_')[1] for code in data)
browser = webdriver.Firefox()
browser_2 = webdriver.Firefox()
first_link = links.next()
browser.get(first_link)
time.sleep(0.5)
for i,link in enumerate(links):
if i % 2: # i starts at 0
browser_2.get(link)
time.sleep(0.5)
try:
content = browser.page_source
name = re.findall(re.findall('&LIC=(.+)&SAW',link)[0]
with open(output_path + name,'w') as output:
output.write((content_2))
print 'error ' + str(i)
else:
browser.get(link)
time.sleep(0.5)
try:
content_2 = browser_2.page_source
name = re.findall(re.findall('&LIC=(.+)&SAW',link)[0]
with open(output_path + name,'w') as output:
output.write((content ))
except:
print 'error ' + str(i)
But the script is waiting to the first page to charge completely before open open the next one, also this approach is bounded to only to page at the same time
EDIT.
I made the following changes to the code of GIRISH RAMNANI
Create the browser instances outside the function
driver_1 = webdriver.Firefox()
driver_2 = webdriver.Firefox()
driver_3 = webdriver.Firefox()
drivers_instance = [driver_1,driver_2,driver_3]
Use the driver and the url as input for the function
def get_content(url,driver):
driver.get(url)
tag = driver.find_element_by_tag_name("a")
# do your work here and return the result
return tag.get_attribute("href")
create a pair of link/ browser using the zip function
with ThreadPoolExecutor(max_workers=2) as ex:
zip_list = zip(links, cycle(drivers_instance)) if len(links) > len(drivers_instance) else zip(cycle(links), drivers_instance)
for par in zip_list:
futures.append(ex.submit(get_content,par[0],par[1]))
multiprocessingmodule to create seperateProcessfor each browserrequests/lxml(or BS4) or perhapsmechanize(if you need forms) should be a lot faster by default. You can also use the aforementioned tools with scrapy, in case you need to scrape a LOT of pages.