7

I have a lot of image files in a folder (5M+). These images are of different sizes. I want to resize these images to 128x128.

I used the following function in a loop to resize in Python using OpenCV

def read_image(img_path):
    # print(img_path)
    img = cv2.imread(img_path)
    img = cv2.resize(img, (128, 128))
    return img

for file in tqdm(glob.glob('train-images//*.jpg')):
    img = read_image(file)
    img = cv2.imwrite(file, img)

But it will take more than 7 hours to complete. I was wondering whether there are any method to speed up this process.

Can I implement parallel processing to do this efficiently with dask or something.? If so how is it possible.?

2
  • just to make sure, you're intending to overwrite the originals? Commented Nov 4, 2018 at 6:02
  • @Aaron, Yes I am trying to over-write the files. Commented Nov 4, 2018 at 6:40

2 Answers 2

12

If you are absolutely intent on doing this in Python, then please just disregard my answer. If you are interested in getting the job done simply and fast, read on...

I would suggest GNU Parallel if you have lots of things to be done in parallel and even more so as CPUs become "fatter" with more cores rather than "taller" with higher clock rates (GHz).

At its simplest, you can use ImageMagick just from the command line in Linux, macOS and Windows like this to resize a bunch of images:

magick mogrify -resize 128x128\! *.jpg

If you have hundreds of images, you would be better running that in parallel which would be:

parallel magick mogrify -resize 128x128\! ::: *.jpg

If you have millions of images, the expansion of *.jpg will overflow your shell's command buffer, so you can use the following to feed the image names in on stdin instead of passing them as parameters:

find -iname \*.jpg -print0 | parallel -0 -X --eta magick mogrify -resize 128x128\!

There are two "tricks" here:

  • I use find ... -print0 along with parallel -0 to null-terminate filenames so there are no problems with spaces in them,

  • I use parallel -X which means, rather than start a whole new mogrify process for each image, GNU Parallel works out how many filenames mogrify can accept, and gives it that many in batches.

I commend both tools to you.


Whilst the ImageMagick aspects of the above answer work on Windows, I don't use Windows and I am unsure about using GNU Parallel there. I think it maybe runs under git-bash and/or maybe under Cygwin - you could try asking a separate question - they are free!

As regards the ImageMagick part, I think you can get a listing of all the JPEG filenames in a file using this command:

DIR /S /B *.JPG > filenames.txt

You can then probably process them (not in parallel) like this:

magick mogrify -resize 128x128\! @filenames.txt

And if you find out how to run GNU Parallel on Windows, you can probably process them in parallel using something like this:

parallel --eta -a filenames.txt magick mogrify -resize 128x128\!
Sign up to request clarification or add additional context in comments.

2 Comments

I have added what little I know about Windows.
GNU Parallel is tested on both git-bash and CygWin at least once per year. Basic functionality works. File a bug report if advanced functionality does not.
5

If these images are stored on a magnetic hard drive, you may very well find you are limited by read/write speed (lots of small reads and writes are very slow on spinning magnetic disks).

Otherwise you can always throw the problem at a processing pool to utilize multiple cores:

from multiprocessing.dummy import Pool
from multiprocessing.sharedctypes import Value
from ctypes import c_int
import time, cv2, os

wdir = r'C:\folder full of large images'
os.chdir(wdir)

def read_imagecv2(img_path, counter):
    # print(img_path)
    img = cv2.imread(img_path)
    img = cv2.resize(img, (128, 128))
    cv2.imwrite('resized_'+img_path, img) #write the image in the child process (I didn't want to overwrite my images)
    with counter.get_lock(): #processing pools give no way to check up on progress, so we make our own
        counter.value += 1

if __name__ == '__main__':
    # start 4 worker processes
    with Pool(processes=4) as pool: #this should be the same as your processor cores (or less)
        counter = Value(c_int, 0) #using sharedctypes with mp.dummy isn't needed anymore, but we already wrote the code once...
        chunksize = 4 #making this larger might improve speed (less important the longer a single function call takes)
        result = pool.starmap_async(read_imagecv2, #function to send to the worker pool
                                    ((file, counter) for file in os.listdir(os.getcwd()) if file.endswith('.jpg')),  #generator to fill in function args
                                    chunksize) #how many jobs to submit to each worker at once
        while not result.ready(): #print out progress to indicate program is still working.
            #with counter.get_lock(): #you could lock here but you're not modifying the value, so nothing bad will happen if a write occurs simultaneously
            #just don't `time.sleep()` while you're holding the lock
            print("\rcompleted {} images   ".format(counter.value), end='')
            time.sleep(.5)
        print('\nCompleted all images')

Due to a somewhat known problem with cv2 not playing nice with multiprocessing, we can use threads instead of processes by replacing multiprocessing.Pool with multiprocessing.dummy.Pool. Many openCV functions release the GIL anyway, so we still should see the computational benefit of using multiple cores at once. Additionally this reduces some amount of overhead, as threads aren't as heavy as processes. After some investigating I have not found an image library that does play nice with processes. They all seem to fail when trying to pickle a function to send to the child processes (how items of work are sent to child processes for computation).

9 Comments

I will try and let you know the results
Code ran without any error, it showed the tqdm progress bar but it was at 100% by a 1 second and as I suspected the files were not resized
@SreeramTP It seems this is a known problem: forums.fast.ai/t/… Given this, I'll re-write for threads rather than processes (easier anyway, just it usually doesn't allow much perf gain because GIL).
I will try running the code and let you know the results
@SreeramTP I might also suggest replacing opencv, with something like skimage or PIL. They each have their advantages, and you may find one might work here better than another.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.