2

I am currently writing a simple crawler in python2.7 using urllib2. Here is the downloader class.

class Downloader:
    def __init__(self, limit = 3):
        self.limit = limit

    def downloadGet(self, url):
        request = urllib2.Request(url)
        retry = 0
        succ = False
        page = None

        while retry < self.limit:
            print "Retry: " + str(retry) + " Limit:" + str(self.limit)
            try:
                response = urllib2.urlopen(request)
                page = response.read()
                succ = True
                break
            except:
                retry += 1

        return succ, page

Every url will be tried for three times. Multi-threading is also used, and the thread code is as follows:

class DownloadThread(Thread):
    def __init__(self, requestGet, limit):
        Thread.__init__(self)
        self.requestGet = requestGet
        self.downloader = Downloader(limit)

    def run(self):
        while True:
            url = self.requestGet()
            if url == None:
                break

            ret = self.download(url)
            print ret

    def download(self, url):
        # some other staff
        succ, flv = self.downloader.downloadGet(url)
        return succ

However, during experiments, in which threads' number is set to 5, the downloader does not stop after trying for 3 times. The output shows even "Retry: 4280 Limit:3" for some thread. It seems the while condition is ignored.

Any help and suggestion is firmly welcomed. Thank you!

3
  • Could you show the code that create DownloadThread instance? Commented Nov 6, 2013 at 10:53
  • Is it possible you are reading the limit from the command line without converting it to an int first? If limit is actually "3" and not the integer 3, you would get behavior like this, e.g., 4280 < "3" is True. Commented Nov 6, 2013 at 10:56
  • @Constantine Thanks! That's where the trick lies. I do forget to do the converting after reading the "limit" parameter from file. Commented Nov 6, 2013 at 11:13

3 Answers 3

5

One possible cause of the infinite loop in downloadGet: limit is string object.

if limit is string, retry < self.limit yield True in Python 2.x:

>>> retry = 4280
>>> limit = '3'
>>> retry < limit
True

Check the type of the limit passed.

Sign up to request clarification or add additional context in comments.

Comments

0

You don't have anything in your DownloadThread code to break out of the while loop if URL is not empty.

1 Comment

I think thats not the problem. In the next iteration, url might become empty.
0

You should define your loop in a more Pythonic fashion:

def downloadGet(self, url):
    ...
    # do not declare retry before this
    for retry in xrange(self.limit):
        ...
        try:

EDIT:

Alternately you could take advantage of while to handle your loop state more clearly than trying to break (although I feel like my first example is less fragile):

def downloadGt(self, url):
    ...
    while retry in xrange(self.limit) or succ == False:
        ...

This has the benefit of being more self-documented.

Though, I would consider refactoring the loop into download instead of downloader. Something like this:

class DownloadThread(Thread):
    ...
    def download(self, url):
        for retry in xrange(self.downloader.limit):
            succ, flv = self.downloader.downloadGet(url)
            if succ:
                return succ


class Downloader(object):
    ...
    def downloadGet(self, url)
        request = urllib2.Request(url)
        try:
            response = urllib2.urlopen(request)
            page = response.read()

        # always qualify your exception handlers 
        # or you may be masking errors you don't know about
        except urllib2.HTTPError:
            return False, None

        return True, page

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.