Download image file from the HTML page source

Question

I am writing a scraper that downloads all the image files from a HTML page and saves them to a specific folder. All the images are part of the HTML page.

"How can I %s" % title

Federico A. Ramponi
– Federico A. Ramponi

2008-11-02 21:36:02 +00:00
Commented Nov 2, 2008 at 21:36 — Federico A. Ramponi
– Federico A. Ramponi, Commented Nov 2, 2008 at 21:36

7 revs, 3 users 88% · Accepted Answer · 2018-03-17 01:17:32Z

91

Here is some code to download all the images from the supplied URL, and save them in the specified output folder. You can modify it to your own needs.

"""
dumpimages.py
    Downloads all the images on the supplied URL, and saves them to the
    specified output file ("/test/" by default)

Usage:
    python dumpimages.py http://example.com/ [output]
"""
from bs4 import BeautifulSoup as bs
from urllib.request import (
    urlopen, urlparse, urlunparse, urlretrieve)
import os
import sys

def main(url, out_folder="/test/"):
    """Downloads all the images at 'url' to /test/"""
    soup = bs(urlopen(url))
    parsed = list(urlparse(url))

    for image in soup.findAll("img"):
        print("Image: %(src)s" % image)
        filename = image["src"].split("/")[-1]
        parsed[2] = image["src"]
        outpath = os.path.join(out_folder, filename)
        if image["src"].lower().startswith("http"):
            urlretrieve(image["src"], outpath)
        else:
            urlretrieve(urlunparse(parsed), outpath)

def _usage():
    print("usage: python dumpimages.py http://example.com [outpath]")

if __name__ == "__main__":
    url = sys.argv[-1]
    out_folder = "/test/"
    if not url.lower().startswith("http"):
        out_folder = sys.argv[-1]
        url = sys.argv[-2]
        if not url.lower().startswith("http"):
            _usage()
            sys.exit(-1)
    main(url, out_folder)

Edit: You can specify the output folder now.

edited Mar 17, 2018 at 1:17

community wiki

7 revs, 3 users 88%
Ryan Ginstrom

Sign up to request clarification or add additional context in comments.

3 Comments

jfs Over a year ago

open(..).write(urlopen(..) could be replaced by urllib.urlretrieve()

Niklas B. Over a year ago

Your code fails if image locations are specified relative to the HTML document. Can you please include the fix provided by unutbu in case someone uses your script in the future?

foresightyj Over a year ago

@NiklasB. I encountered the same problem. I ended up just using regexp to find all images links, which is more reliable than Beautifulsoup in my opinion.

Catherine Devlin · Accepted Answer · 2010-11-17 00:49:24Z

14

Ryan's solution is good, but fails if the image source URLs are absolute URLs or anything that doesn't give a good result when simply concatenated to the main page URL. urljoin recognizes absolute vs. relative URLs, so replace the loop in the middle with:

for image in soup.findAll("img"):
    print "Image: %(src)s" % image
    image_url = urlparse.urljoin(url, image['src'])
    filename = image["src"].split("/")[-1]
    outpath = os.path.join(out_folder, filename)
    urlretrieve(image_url, outpath)

answered Nov 17, 2010 at 0:49

Catherine Devlin

7,8032 gold badges27 silver badges17 bronze badges

Comments

user20955 · Accepted Answer · 2008-11-02 21:33:53Z

9

You have to download the page and parse html document, find your image with regex and download it.. You can use urllib2 for downloading and Beautiful Soup for parsing html file.

answered Nov 2, 2008 at 21:33

user20955

2,6224 gold badges25 silver badges23 bronze badges

Comments

Dingo · Accepted Answer · 2010-03-15 15:35:20Z

9

And this is function for download one image:

def download_photo(self, img_url, filename):
    file_path = "%s%s" % (DOWNLOADED_IMAGE_PATH, filename)
    downloaded_image = file(file_path, "wb")

    image_on_web = urllib.urlopen(img_url)
    while True:
        buf = image_on_web.read(65536)
        if len(buf) == 0:
            break
        downloaded_image.write(buf)
    downloaded_image.close()
    image_on_web.close()

    return file_path

answered Mar 15, 2010 at 15:35

Dingo

2,7041 gold badge19 silver badges16 bronze badges

1 Comment

Ron Over a year ago

works fine for me when removing the while loop (not its content!)

Martin v. Löwis · Accepted Answer · 2008-11-02 21:34:28Z

3

Use htmllib to extract all img tags (override do_img), then use urllib2 to download all the images.

answered Nov 2, 2008 at 21:34

Martin v. Löwis

128k20 gold badges205 silver badges238 bronze badges

2 Comments

Ali Afshar Over a year ago

This assumes non-broken html, which Beautiful Soup can cope with.

tzot Over a year ago

On the other hand, this is using only standard library modules.

Lerner Zhang · Accepted Answer · 2014-07-19 07:29:33Z

1

If the request need an authorization refer to this one:

r_img = requests.get(img_url, auth=(username, password)) 
f = open('000000.jpg','wb') 
f.write(r_img.content) 
f.close()

answered Jul 19, 2014 at 7:29

Lerner Zhang

7,2782 gold badges59 silver badges80 bronze badges

Comments

imbr · Accepted Answer · 2022-03-25 13:15:15Z

Based on code here

Removing some lines of code, you'll get only the images img tags.

Uses Python 3+ Requests, BeautifulSoup and other standard libraries.

import os, sys
import requests
from urllib import parse
from bs4 import BeautifulSoup
import re

def savePageImages(url, imagespath='images'):
    def soupfindnSave(pagefolder, tag2find='img', inner='src'):
        if not os.path.exists(pagefolder): # create only once
            os.mkdir(pagefolder)
        for res in soup.findAll(tag2find):  
            if res.has_attr(inner): # check inner tag (file object) MUST exists
                try:
                    filename, ext = os.path.splitext(os.path.basename(res[inner])) # get name and extension
                    filename = re.sub('\W+', '', filename) + ext # clean special chars from name
                    fileurl = parse.urljoin(url, res.get(inner))
                    filepath = os.path.join(pagefolder, filename)
                    if not os.path.isfile(filepath): # was not downloaded
                        with open(filepath, 'wb') as file:
                            filebin = session.get(fileurl)
                            file.write(filebin.content)
                except Exception as exc:
                    print(exc, file=sys.stderr)   
    session = requests.Session()
    #... whatever other requests config you need here
    response = session.get(url)
    soup = BeautifulSoup(response.text, "html.parser")
    soupfindnSave(imagespath, 'img', 'src')

Use like this bellow to save the google.com page images in a folder google_images:

savePageImages('https://www.google.com', 'google_images')

Hassan Zamir · Accepted Answer · 2022-08-14 15:20:41Z

1

import urllib.request as req

with req.urlopen(image_link) as d, open(image_location, "wb") as image_object:
    data = d.read()
    image_object.write(data)

edited Aug 14, 2022 at 15:20

answered Aug 14, 2022 at 15:18

Hassan Zamir

113 bronze badges

1 Comment

Trenton McKinney Over a year ago

This answer was reviewed in the Low Quality Queue. Here are some guidelines for How do I write a good answer?. Code only answers are not considered good answers, and are likely to be downvoted and/or deleted because they are less useful to a community of learners. It's only obvious to you. Explain what it does, and how it's different / better than existing answers. From Review

Collectives™ on Stack Overflow

Download image file from the HTML page source

8 Answers 8

3 Comments

Comments

Comments

1 Comment

2 Comments

Comments

Based on code here

Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

8 Answers 8

3 Comments

Comments

Comments

1 Comment

2 Comments

Comments

Based on code here

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related