I need to speed up python code with numpy

Question

This code is the Gaussian noise symbol in the photo. There was a speed issue: fhd photo is processed in about 0.45 seconds. This is impermissible for my tasks. I need to reach a speed of at least milliseconds.

import numpy as np
import cv2

image = cv2.imread('1.jpg')

row,col,ch= image.shape
mean = 0
var = 0.1
sigma = var**0.5
gauss = np.random.normal(mean,sigma,(row,col,ch))
gauss = gauss.reshape(row,col,ch)
noisy = image + gauss

cv2.imwrite('2.jpg', noisy)

Already optimized the slowest part of the code (generating an array of random numbers) (this operation takes about 0.32s):

gauss = np.random.normal(mean,sigma,(row,col,ch))
gauss = gauss.reshape(row,col,ch

I reduced the matrix a hundred times, and then multiplied it a hundred times:

roww=int(row/100)
b = timeit.default_timer()
gauss = np.random.normal(mean,sigma,(roww,col,ch))
gauss = gauss.reshape(roww*col*ch)
gauss = np.tile(gauss, 100)
gauss = gauss.reshape(row,col,ch)

The code above takes 20ms, of which the most time is spent multiplying one matrix into a large one (18ms):

gauss = np.tile(gauss, 100)

How could you make this operation faster?

And now to the main problem: all this code still takes a very long time (170ms), the most time-consuming operations:

Adding matrices takes 30ms.

noisy=image+gauss

Opening (35ms)

image = cv2.imread("1.jpg"

and saving (90ms) photo:

cv2.imwrite('2.jpg', noisy)

Is it possible to speed up these operations in any way in python? Thanks!

Full code:

import numpy as np
import cv2

image = cv2.imread('1.jpg')

row,col,ch= image.shape
mean = 0
var = 0.1
sigma = 10
roww=int(row/100)
gauss = np.random.normal(mean,sigma,(roww,col,ch))
gauss = gauss.reshape(roww*col*ch)
gauss = np.tile(gauss, 100)
gauss = gauss.reshape(row,col,ch)
noisy=image+gauss

cv2.imwrite('2.jpg', noisy)

Try the numba library, it's compatible with numpy but includes JIT compilation and the ability to run CUDA code, might make it faster. Also see if you could pre-generate a few of those random matrices and keep them around in RAM instead of making a new one every time. — kalatabe
– kalatabe, Commented Dec 17, 2021 at 9:40

Jérôme Richard · Accepted Answer · 2021-12-19 15:24:50Z

2

The first code is bounded by the time to generate the random numbers. This is a generally a slow operation (whatever the language although there are tricks to speed up the performance of low-level native codes). Thus, there is not much to do in Numpy. You can use Numba to parallelize this operation, but note that the current random number generator of Numba is a bit slower in sequential than Numpy.

The read/write operations should be bounded by your storage device and the speed of encoding/decoding. For the former, you can use in-RAM virtual devices, but if you cannot control that then there is nothing to do (apart from using a faster hardware like a Nvme SSD). For the latter, the Python wrapper of OpenCV already use an highly-optimized JPEG encoder/decoder that should already be fast regarding the operation to perform. So you cannot speed up this part, but you can do multiple of them in parallel.

Regarding the Numpy code, there are two main issues:

First, np.random.normal generate a 64-bit floating-point number (float64) array while the image array contains only 8-bit integers (uint8). Working on float64 values is much more expensive than uint8 ones (up to one order of magnitude slower in the worst case). Unfortunately, there is no (simple) way to generate random integer with a normal distribution as this distribution is tightly bounded to real numbers. Numpy also lack of a parameter to work on float32. Your solution to reuse numbers is quite good to improve the performances. Still, adding a uint8 array with a float64 one is expensive as Numpy converts the first to a float64 array and then produce a new float64 array. You can convert the random array to an uint8 array in the first place but this is not so easy in practice. Indeed, the negative values cannot be converted to correct uint8 ones and even if they could, the addition likely cause some overflows. Note that Numba can help to further speed up this part as it can convert the float64 values to uint8 ones on the fly (and in parallel).

Moreover, np.tile should theoretically not copy the array but it sadly does make a copy here. Hopefully, you can remove this expensive copy using broadcasting.

Here is the resulting code:

row,col,ch= image.shape
mean = 0
var = 0.1
sigma = 10
roww=int(row/100)
gauss = np.random.normal(mean,sigma,(roww,col,ch))
noisy=(image.reshape(-1,roww,col,ch) + gauss.astype(np.uint8)).reshape(row, col, ch)

I advise you to perform the whole operation using Numba:

import numba as nb

@nb.njit('uint8[:,:,::1](uint8[:,:,::1])', parallel=True)
def compute(image):
    row, col, ch = image.shape
    mean = 0
    var = 0.1
    sigma = 10
    out = np.empty_like(image)
    roww=int(row/100)
    gauss = np.random.normal(mean,sigma,(roww,col,ch))
    for i in nb.prange(row):
        iWrap = i % roww
        for j in range(col):
            for c in range(ch):
                rnd = gauss[iWrap, j, c]
                intRnd = int(np.round(rnd))
                noisedInt = int(image[i, j, c]) + intRnd
                clampedNoisedInt = min(max(noisedInt, 0), 255)
                out[i, j, c] = clampedNoisedInt
    return out

Here are the timings on my 6-core machine on a 1920x1080x3 image (without including the time to read/write the image):

Initial:              23.2 ms
Optimized with Numpy: 17.9 ms
Optimized with Numba:  4.3 ms

If this is not fast enough, then you need to rewrite this operation in C using low-level fast SIMD intrinsics and multiple threads.

The read/write take respectively about 21/42 ms on my machine (which is actually good since the input/output image are heavily compressed).

edited Dec 19, 2021 at 15:24

answered Dec 17, 2021 at 13:08

Jérôme Richard

53.3k6 gold badges48 silver badges77 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

SergioX13 Over a year ago

Thank you. But my first option works a little crookedly (some artifacts appear). And the second option with numba spends the same time as the first option.imgur.com/a/jn2gyiw

Jérôme Richard Over a year ago

Note that roww was bigger in the Numba code making is slower than it should (I changed the divider because my test image was not divisible by 100). Additionally, you can replace ch by 3 in the code if you are sure the image contains 3 channels (you can add an assert ch == 3 to be safe). This help Numba to unroll the loop making it significantly faster.

SergioX13 Over a year ago

I noticed that with an increase in sigma, the size of the final photo also grows. For example, with sigma=10, the size increases by almost 2 times. Do not prompt with what it can be connected?

Jérôme Richard Over a year ago

@SergioX13 This is weird since sigma is only used in np.random.normal which should only change the standard deviation of the normal distribution and not the array size. So I guess you use sigma somewhere else in your code so it can cause this effect. This is especially true for the Numba code where the size of out is initialized from image.

SergioX13 Over a year ago

this is true for the first option too. sigma is not used anywhere else

Collectives™ on Stack Overflow

I need to speed up python code with numpy

1 Answer 1

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related