4

I am using python-requests library to do my requests.

On the home page of the website, I get a bunch of images and show them to the user. Sometimes those images get deleted and I get a broken image url.

So I want to check whether images exist.

Here is what I did:

items = Item.objects.filter(shop__is_hidden=False, is_hidden=False).order_by("?")[:16]

existing_items = []

for item in items:
    response = requests.head(item.item_low_url)
    if response.status_code == 200:
        existing_items.append(item)

But it is taking a little longer than I want.

Is there any faster way?

1 Answer 1

4

Your requests are blocking and synchronous which is why it is taking a bit of time. In simple terms, it means that the second request doesn't start, until the first one finishes.

Think of it like one conveyer belt with a bunch of boxes and you have one worker to process each box.

The worker can only process one box at a time; and he has to wait for the processing to be done before he can start processing another box (in other words, he cannot take a box from the belt, drop it somewhere to be processed, come back and pick another one).

To reduce the time it takes to processes boxes, you can:

  1. Reduce the time it takes to process each box.
  2. Make it so that multiple boxes can be processed at the same time (in other words, the worker doesn't have to wait).
  3. Increase the number of belts and workers and then divide the boxes between belts.

We really can't do #1 because this delay is from the network (you could reduce the timeout period, but this is not recommended).

Instead what we want to do is #2 - since the processing of one box is independent, we don't need to wait for one box to finish to start processing the next.

So we want to do the following:

  1. Quickly send multiple requests to a server for URLs at the same time.
  2. Wait for each of them to finish (independent of each other).
  3. Collect the results.

There are multiples ways to do this which are listed in the documentation for requests; here is an example using grequests:

import grequests

# Create a map between url and the item
url_to_item = {item.item_low_url: item for item in items}

# Create a request queue, but don't send them
rq = (grequests.head(url) for url in url_to_item.keys())

# Send requests simultaneously, and collect the results,
# and filter those that are valid

# Each item returned in the Response object, which has a request
# property that is the original request to which this is a response;
# we use that to filter out the item objects

results = [url_to_item[i.request.url]
           for i in filter(lambda x: x.status_code == 200,
                           grequests.map(rq)))]
Sign up to request clarification or add additional context in comments.

2 Comments

Thanks for your comprehensive answer. It got definitely faster, but the results turns out to be a list of Response objects, not Items, which is surprising. Any ideas to fix this?
The reason its a bunch of response objects is because that's the return from the grequests.map(rq). See the update for a way to map it back to the original item.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.