Fast way to check if image on remote URL exists in python

Question

I am using python-requests library to do my requests.

On the home page of the website, I get a bunch of images and show them to the user. Sometimes those images get deleted and I get a broken image url.

So I want to check whether images exist.

Here is what I did:

items = Item.objects.filter(shop__is_hidden=False, is_hidden=False).order_by("?")[:16]

existing_items = []

for item in items:
    response = requests.head(item.item_low_url)
    if response.status_code == 200:
        existing_items.append(item)

But it is taking a little longer than I want.

Is there any faster way?

Burhan Khalid · Accepted Answer · 2015-06-02 06:35:28Z

4

Your requests are blocking and synchronous which is why it is taking a bit of time. In simple terms, it means that the second request doesn't start, until the first one finishes.

Think of it like one conveyer belt with a bunch of boxes and you have one worker to process each box.

The worker can only process one box at a time; and he has to wait for the processing to be done before he can start processing another box (in other words, he cannot take a box from the belt, drop it somewhere to be processed, come back and pick another one).

To reduce the time it takes to processes boxes, you can:

Reduce the time it takes to process each box.
Make it so that multiple boxes can be processed at the same time (in other words, the worker doesn't have to wait).
Increase the number of belts and workers and then divide the boxes between belts.

We really can't do #1 because this delay is from the network (you could reduce the timeout period, but this is not recommended).

Instead what we want to do is #2 - since the processing of one box is independent, we don't need to wait for one box to finish to start processing the next.

So we want to do the following:

Quickly send multiple requests to a server for URLs at the same time.
Wait for each of them to finish (independent of each other).
Collect the results.

There are multiples ways to do this which are listed in the documentation for requests; here is an example using grequests:

import grequests

# Create a map between url and the item
url_to_item = {item.item_low_url: item for item in items}

# Create a request queue, but don't send them
rq = (grequests.head(url) for url in url_to_item.keys())

# Send requests simultaneously, and collect the results,
# and filter those that are valid

# Each item returned in the Response object, which has a request
# property that is the original request to which this is a response;
# we use that to filter out the item objects

results = [url_to_item[i.request.url]
           for i in filter(lambda x: x.status_code == 200,
                           grequests.map(rq)))]

edited Jun 2, 2015 at 6:35

answered Jun 2, 2015 at 6:00

Burhan Khalid

175k20 gold badges254 silver badges291 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Jahongir Rahmonov Over a year ago

Thanks for your comprehensive answer. It got definitely faster, but the results turns out to be a list of Response objects, not Items, which is surprising. Any ideas to fix this?

Burhan Khalid Over a year ago

The reason its a bunch of response objects is because that's the return from the grequests.map(rq). See the update for a way to map it back to the original item.

Collectives™ on Stack Overflow

Fast way to check if image on remote URL exists in python

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related