2

I'm scraping a webpage using multithreading and random proxies. My home PC handles this fine with however many processes are required (in the current code, I've set it to 100). RAM usage seems to hit around 2.5gb. However when I run this on my CentOS VPS I get a generic 'Killed' message and the program terminates. With 100 processes running I get the Killed error very, very quickly. I reduced it to a more reasonable 8 and still got the same error, but after a much longer period. Based on a bit of research I'm making the assumption that the 'Killed' error is related to memory usage. Without multithreading, the error does not occur.

So, what can I do to optimise my code to still run quickly, but not use so much memory? Is my best bet to just reduce the number of processes even further? And can I monitor my memory usage from within Python while the program is running?

Edit: I just realised my VPS has 256mb of RAM vs 24gb on my desktop, which was something I didn't consider when writing the code originally.

#Request soup of url, using random proxy / user agent - try different combinations until valid results are returned
def getsoup(url):
    attempts = 0
    while True:
        try:
            proxy = random.choice(working_proxies)
            headers = {'user-agent': random.choice(user_agents)}  
            proxy_dict = {'http': 'http://' + proxy}
            r = requests.get(url, headers, proxies=proxy_dict, timeout=5)
            soup = BeautifulSoup(r.text, "html5lib") #"html.parser"
            totalpages = int(soup.find("div",  class_="pagination").text.split(' of ',1)[1].split('\n', 1)[0])  #Looks for totalpages to verify proper page load 
            currentpage = int(soup.find("div",  class_="pagination").text.split('Page ',1)[1].split(' of', 1)[0])
            if totalpages < 5000: #One particular proxy wasn't returning pagelimit=60 or offset requests properly ..            
                break
        except Exception as e:
            # print 'Error! Proxy: {}, Error msg: {}'.format(proxy,e)
            attempts = attempts + 1        
            if attempts > 30:
                print 'Too many attempts .. something is wrong!'
                sys.exit()
    return (soup, totalpages, currentpage)

#Return soup of page of ads, connecting via random proxy/user agent
def scrape_url(url):
    soup, totalpages, currentpage = getsoup(url)               
    #Extract ads from page soup

    ###[A bunch of code to extract individual ads from the page..]

    # print 'Success! Scraped page #{} of {} pages.'.format(currentpage, totalpages)
    sys.stdout.flush()
    return ads     

def scrapeall():     
    global currentpage, totalpages, offset
    url = "url"

    _, totalpages, _ = getsoup(url + "0")
    url_list = [url + str(60*i) for i in range(totalpages)]

    # Make the pool of workers
    pool = ThreadPool(100)    
    # Open the urls in their own threads and return the results
    results = pool.map(scrape_url, url_list)
    # Close the pool and wait for the work to finish
    pool.close()
    pool.join()

    flatten_results = [item for sublist in results for item in sublist] #Flattens the list of lists returned by multithreading
    return flatten_results

adscrape = scrapeall() 
3
  • Most likely with only 256MB RAM the process will be killed for too much memory usage even if it's not multithreaded. You have to keep in mind that not even all of that 256MB is available. Scraping uses a good deal of memory depending on the pages. Commented Feb 27, 2016 at 0:36
  • would you like to queue the requests in a line? Commented Feb 27, 2016 at 0:39
  • Peter, what can I do to reduce memory usage? I have removed multithreading and yes, it still does crash Commented Feb 27, 2016 at 23:31

1 Answer 1

3

BeautifulSoup is pure Python library and on a mid range web site it will eat a lot of memory. If it's an option, try replacing it with lxml, which is faster and written in C. It might still run out of memory, if your pages are large though.

As already suggested in the comments, you could use queue.Queue to store responses. A better version would be to retrieve responses to disk, store the filename in a queue, and parse them in a separate process. For that you can use multiprocessing library. If parsing runs out of memory and gets killed, fetching continues. This pattern is known as fork and die and is a common workaround with Python using too much memory.

Then you also need to have a way to see which responses have failed parsing.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.