Return to Answer

Added performance-related question.

Source Link

edited Dec 24, 2024 at 22:35

4.1k
4
15

Performance

Finally, if you can, do not batch the URLs; we would like to invoke scrap_blog_content only once so that we do not have to recreate the multiprocessing pool and thus the Chrome drivers. But if you must create batches, create the pool once in your if __name__ == "__main__": block and pass it to scrap_blog_content.

Question

Why do you have a call to time.sleep(random.uniform(1, 5)) in scrap_blog_content?

Performance

Question

Why do you have a call to time.sleep(random.uniform(1, 5)) in scrap_blog_content?

Source Link

answered Dec 24, 2024 at 19:21

Booboo

4.1k
4
15

I will skip over the usual suggestions concerning adding docstrings to the module and functions, using type hinting and adding comments where they would be useful to the reader and proceed directly to your issue with performance.

You are doing a couple of things that are hurting performance:

You are taking the URLs you want to process and breaking them up into batches of a 1000 that you then submit to process_urls. Each submission results in creation or a re-creation of the multithreading pool. Creating threads are less expensive than creating processes so I can't say that if you were to restructure your code so that you could reuse a single pool you would be making a huge impact in performance. But for a following suggestion I will be making below, having a single, reusable pool will be required for best performance. Is there a particular reason why you are even submitting the URLs in batches? If so, I see no reason why you still cannot use a single pool.
Each URL you submit is creating a new Chrome driver. Your submit call references a function crawl_blog_content, which is undefined. I suspect this is supposed to be scrap_blog_content, which is defined (by the way, scrape_blog_content would be a better name as the verb scrap means to get rid of, and you certainly do not want to do that). Every time you create a new Chrome driver a new process is being created, which is expensive, and the driver has to execute initialization code before it can take your requests. It would be ideal if we can reuse drivers. So if you are creating a pool of 8 threads, you would need to create 8 reusable drivers, i.e. one per thread.

The way to achieve a single driver per thread that is reusable is to use a pool initializer function that is invoked for each thread in the pool before it starts processing submitted tasks. This function would create a Chrome driver and store it in thread local storage that is unique for each thread. The only complication is that when all submitted tasks have been completed and the pool is terminated, we would like to call quit on these drivers so that they are terminated instead of lying around even after your script terminates. The way to do that is to enclose the driver in a wrapper class that defines a __del__ method that will "quit" the driver when the wrapper is garbage collected, which will occur when thread local storage is garbage collected, which will occur when your thread pool is terminated.

Here is the basic code:

import threading
...

class DriverWrapper:
    def __init__(self):
        chrome_options = webdriver.ChromeOptions()
        chrome_options.add_argument("--headless")  
        chrome_options.add_argument("--no-sandbox")
        chrome_options.add_argument("--disable-dev-shm-usage")

        self.driver = webdriver.Chrome(options=chrome_options)
        self.driver.execute_cdp_cmd('Network.enable', {})
        try:
            self.driver.execute_cdp_cmd('Network.setBlockedURLs', {
                "urls": ["*.png", "*.jpg", "*.jpeg", "*.gif", "*.webp", "*.mp4", "*.avi", "*.mkv", "*.mov"]
            })
        except Exception as e:
            logging.error(f"Error setting blocked URLs: {e}")

    def __del__(self):
        self.driver.quit()

thread_local = threading.local()

def init_pool():
    thread_local.driver_wrapper = DriverWrapper()

def get_driver():
    return thread_localthread_local.driver_wrapper.driver

So process_urls becomes:

def process_urls(urls):
    results = []
    # Specify a pool initializer:
    with ThreadPoolExecutor(max_workers=8, initializer=init_pool) as executor:
        future_to_url = {executor.submit(crawl_blog_content, url): url for url in urls}
    ...

scrap_blow_content now becomes:

def scrap_blog_content(url):
    # The start_driver function is no longer used; the code has been
    # moved to init_pool:
    driver = get_driver()  # Get the driver from thread local storage
    try:
        driver.get(url)
        ...
    except Exception as e:
        logging.error(f"Error while fetching content from {url}: {e}")
        return None
    # The finally block that quits the driver has been removed

Note that there is no longer a call to driver.quit() in the above function.