Optimization of numpy array iteration

Question

I'm trying to optimize the performance of my python program and I think I have identified this piece of code as bottleneck:

  for i in range(len(green_list)):
    rgb_list = []

    for j in range(len(green_list[i])):
      rgb_list.append('%02x%02x%02x' % (red_list[i][j], green_list[i][j], blue_list[i][j]))

    write_file(str(i), rgb_list)

Where red_list, green_list and blue_list are numpy arrays with values like this:

red_list = [[1, 2, 3, 4, 5], [51, 52, 53, 54, 55]]

green_list = [[6, 7, 8, 9, 10], [56, 57, 58, 59, 60]]

blue_list = [[11, 12, 13, 14, 15], [61, 62, 63, 64, 65]]

At the end of each execution of the inner-for rgb_list is containing the hex values:

rgb_list = ['01060b', '02070c', '03080d', '04090e', '050a01']

Now, it is not clear to me how to exploit the potential of numpy arrays but I think there is a way to optimize those two nested loops. Any suggestions?

Sorry, but your rgb_list essentially gets reset to an empty list for each cycle of the first loop. Is that intended? — Thomas Hilger
– Thomas Hilger, Commented Jun 16, 2022 at 7:41
I'd suggest you to come up with a more self-consistent piece of code with clear input and output. For example, the write_file() could be replaced with just saving the result on the outer loop to a list that lives outside of it. Eventually, I would wrap it in a function. — norok2
– norok2, Commented Jun 16, 2022 at 7:52
The question title mentions numpy but there is no numpy array. Does each list is a 2D numpy array with a same row length? — J. Choi
– J. Choi, Commented Jun 16, 2022 at 7:56
That input is just a trivial example. Arrays will contain different values and more than 5 elements (potentially milions) — pqc23563
– pqc23563, Commented Jun 16, 2022 at 8:18

norok2 · Accepted Answer · 2022-06-17 06:41:05Z

I assume the essential traits of your code could be summarized in the following generator:

import numpy as np


def as_str_OP(r_arr, g_arr, b_arr):
    n, m = r_arr.shape
    rgbs = []
    for i in range(n):
        rgb = []
        for j in range(m):
            rgb.append('%02x%02x%02x' % (r_arr[i, j], g_arr[i, j], b_arr[i, j]))
        yield rgb

which can be consumed with a for loop, for example to write to disk:

for x in as_str_OP(r_arr, g_arr, b_arr):
    write_to_disk(x)

The generator itself can be written either with the core computation vectorized in Python or in a Numba-friendly way. The key is to replace the relatively slow string interpolation with a int-to-hex custom-made computation.

This results in substantial speed-up, especially as the size of the input grows (and particularly the second dimension).

Below is the NumPy-vectorized version:

def as_str_np(r_arr, g_arr, b_arr):
    l = 3
    n, m = r_arr.shape
    rgbs = []
    for i in range(n):
        rgb = np.empty((m, 2 * l), dtype=np.uint32)
        r0, r1 = divmod(r_arr[i, :], 16)
        g0, g1 = divmod(g_arr[i, :], 16)
        b0, b1 = divmod(b_arr[i, :], 16)
        rgb[:, 0] = hex_to_ascii(r0)
        rgb[:, 1] = hex_to_ascii(r1)
        rgb[:, 2] = hex_to_ascii(g0)
        rgb[:, 3] = hex_to_ascii(g1)
        rgb[:, 4] = hex_to_ascii(b0)
        rgb[:, 5] = hex_to_ascii(b1)
        yield rgb.view(f'<U{2 * l}').reshape(m).tolist()

and the Numba-accelerated version:

import numba as nb


@nb.njit
def hex_to_ascii(x):
    ascii_num_offset = 48  # ord(b'0') == 48
    ascii_alp_offset = 87  # ord(b'a') == 97, (num of non-alpha digits) == 10
    return x + (ascii_num_offset if x < 10 else ascii_alp_offset)


@nb.njit
def _to_hex_2d(x):
    a, b = divmod(x, 16)
    return hex_to_ascii(a), hex_to_ascii(b)


@nb.njit
def _as_str_nb(r_arr, g_arr, b_arr):
    l = 3
    n, m = r_arr.shape
    for i in range(n):
        rgb = np.empty((m, 2 * l), dtype=np.uint32)
        for j in range(m):
            rgb[j, 0:2] = _to_hex_2d(r_arr[i, j])
            rgb[j, 2:4] = _to_hex_2d(g_arr[i, j])
            rgb[j, 4:6] = _to_hex_2d(b_arr[i, j])
        yield rgb


def as_str_nb(r_arr, g_arr, b_arr):
    l = 3
    n, m = r_arr.shape
    for x in _as_str_nb(r_arr, g_arr, b_arr):
        yield x.view(f'<U{2 * l}').reshape(m).tolist()

This essentially involves manually writing the number, correctly converted to hexadecimal ASCII chars, into a properly typed array, which can then be converted to give the desired output.

Note that the final numpy.ndarray.tolist() could be avoided if whatever will consume the generator is capable of dealing with the NumPy array itself, thus saving some potentially large and definitely appreciable time, e.g.:

def as_str_nba(r_arr, g_arr, b_arr):
    l = 3
    n, m = r_arr.shape
    for x in _as_str_nb(r_arr, g_arr, b_arr):
        yield x.view(f'<U{2 * l}').reshape(m)

Overcoming IO-bound bottleneck

However, if you are IO-bounded you should modify your code to write in blocks, e.g using the grouper recipe from itertools recipes:

from itertools import zip_longest


def grouper(iterable, n, *, incomplete='fill', fillvalue=None):
    "Collect data into non-overlapping fixed-length chunks or blocks"
    # grouper('ABCDEFG', 3, fillvalue='x') --> ABC DEF Gxx
    # grouper('ABCDEFG', 3, incomplete='strict') --> ABC DEF ValueError
    # grouper('ABCDEFG', 3, incomplete='ignore') --> ABC DEF
    args = [iter(iterable)] * n
    if incomplete == 'fill':
        return zip_longest(*args, fillvalue=fillvalue)
    if incomplete == 'strict':
        return zip(*args, strict=True)
    if incomplete == 'ignore':
        return zip(*args)
    else:
        raise ValueError('Expected fill, strict, or ignore')

to be used like:

group_size = 3
for x in grouper(as_str_OP(r_arr, g_arr, b_arr), group_size):
    write_many_to_disk(x)

Testing out the output

Some dummy input can be produced easily (r_arr is essentially red_list, etc.):

def gen_color(n, m):
    return np.random.randint(0, 2 ** 8, (n, m))


N, M = 10, 3
r_arr = gen_color(N, M)
g_arr = gen_color(N, M)
b_arr = gen_color(N, M)

and tested by consuming the generator to produce a list:

res_OP = list(as_str_OP(r_arr, g_arr, b_arr))
res_np = list(as_str_np(r_arr, g_arr, b_arr))
res_nb = list(as_str_nb(r_arr, g_arr, b_arr))
res_nba = list(as_str_nba(r_arr, g_arr, b_arr))
print(np.array(res_OP))
# [['1f6984' '916d98' 'f9d779']
#  ['65f895' 'ded23e' '332fdc']
#  ['b9e059' 'ce8676' 'cb75e9']
#  ['bca0fc' '3289a9' 'cc3d3a']
#  ['6bb0be' '07134a' 'c3cf05']
#  ['152d5c' 'bac081' 'c59a08']
#  ['97efcc' '4c31c0' '957693']
#  ['15247e' 'af8f0a' 'ffb89a']
#  ['161333' '8f41ce' '187b01']
#  ['d811ae' '730b17' 'd2e269']]
print(res_OP == res_np)
# True
print(res_OP == res_nb)
# True
print(res_OP == [x.tolist() for x in res_nba])
# True

eventually passing through some grouping:

k = 3
res_OP = list(grouper(as_str_OP(r_arr, g_arr, b_arr), k))
res_np = list(grouper(as_str_np(r_arr, g_arr, b_arr), k))
res_nb = list(grouper(as_str_nb(r_arr, g_arr, b_arr), k))
res_nba = list(grouper(as_str_nba(r_arr, g_arr, b_arr), k))
print(np.array(res_OP))
# [[list(['1f6984', '916d98', 'f9d779'])
#   list(['65f895', 'ded23e', '332fdc'])
#   list(['b9e059', 'ce8676', 'cb75e9'])]
#  [list(['bca0fc', '3289a9', 'cc3d3a'])
#   list(['6bb0be', '07134a', 'c3cf05'])
#   list(['152d5c', 'bac081', 'c59a08'])]
#  [list(['97efcc', '4c31c0', '957693'])
#   list(['15247e', 'af8f0a', 'ffb89a'])
#   list(['161333', '8f41ce', '187b01'])]
#  [list(['d811ae', '730b17', 'd2e269']) None None]]
print(res_OP == res_np)
# True
print(res_OP == res_nb)
# True
print(res_OP == [tuple(y.tolist() if y is not None else y for y in x) for x in res_nba])
# True

Benchmarks

To give you some ideas of the numbers we could be talking, let us use %timeit on much larger inputs:

N, M = 1000, 1000
r_arr = gen_color(N, M)
g_arr = gen_color(N, M)
b_arr = gen_color(N, M)


%timeit -n 1 -r 1 list(as_str_OP(r_arr, g_arr, b_arr))
# 1 loop, best of 1: 1.1 s per loop
%timeit -n 4 -r 4 list(as_str_np(r_arr, g_arr, b_arr))
# 4 loops, best of 4: 279 ms per loop
%timeit -n 4 -r 4 list(as_str_nb(r_arr, g_arr, b_arr))
# 1 loop, best of 1: 96.5 ms per loop
%timeit -n 4 -r 4 list(as_str_nba(r_arr, g_arr, b_arr))
# 4 loops, best of 4: 10.4 ms per loop

To simulate disk writing we could use the following consumer:

import time
import math


def consumer(gen, timeout_sec=0.001, weight=1):
    result = []
    for x in gen:
        result.append(x)
        time.sleep(timeout_sec * weight)
    return result

where disk writing is simulated with a time.sleep() call with a timeout depending on the logarithm of the object size:

N, M = 1000, 1000
r_arr = gen_color(N, M)
g_arr = gen_color(N, M)
b_arr = gen_color(N, M)


%timeit -n 1 -r 1 consumer(as_str_OP(r_arr, g_arr, b_arr), weight=math.log2(2))
# 1 loop, best of 1: 2.37 s per loop
%timeit -n 1 -r 1 consumer(as_str_np(r_arr, g_arr, b_arr), weight=math.log2(2))
# 1 loop, best of 1: 1.48 s per loop
%timeit -n 1 -r 1 consumer(as_str_nb(r_arr, g_arr, b_arr), weight=math.log2(2))
# 1 loop, best of 1: 1.27 s per loop
%timeit -n 1 -r 1 consumer(as_str_nba(r_arr, g_arr, b_arr), weight=math.log2(2))
# 1 loop, best of 1: 1.13 s per loop

k = 100
%timeit -n 1 -r 1 consumer(grouper(as_str_OP(r_arr, g_arr, b_arr), k), weight=math.log2(1 + k))
# 1 loop, best of 1: 1.17 s per loop
%timeit -n 1 -r 1 consumer(grouper(as_str_np(r_arr, g_arr, b_arr), k), weight=math.log2(1 + k))
# 1 loop, best of 1: 368 ms per loop
%timeit -n 1 -r 1 consumer(grouper(as_str_nb(r_arr, g_arr, b_arr), k), weight=math.log2(1 + k))
# 1 loop, best of 1: 173 ms per loop
%timeit -n 1 -r 1 consumer(grouper(as_str_nba(r_arr, g_arr, b_arr), k), weight=math.log2(1 + k))
# 1 loop, best of 1: 87.4 ms per loop

Ignoring the disk-writing simulation, the NumPy-vectorized approach is ~4x faster with the test input sizes, while Numba-accelerated approach gets ~10x to ~100x faster depending on whether the potentially useless conversion to list() with numpy.ndarray.tolist() is present or not.

When it comes to the simulated disk-writing, the faster versions are all more or less equivalent, and noticeably less effective without grouping, resulting in ~2x speed-up. With grouping alone the speed-up gets to be ~2x, but when combining it with the faster approaches, the speed-ups fare between ~3x of the NumPy-vectorized version and the ~7x or ~13x of the Numba-accelerated approaches (with or without numpy.ndarray.tolist()).

Again, this is with the given input, and under the test conditions. The actual mileage may vary.

Ohad Sharet · Accepted Answer · 2022-06-16 09:16:48Z

1

you could use reduce for the inner loop, making it possible for your computer to divide the computations between different threads behind the scenes

for i in range(len(green_list)):
    rgb_list = reduce(lambda ls, j: ls + ['%02x%02x%02x' % (red_list[i][j], green_list[i][j], blue_list[i][j])],range(len(green_list[i])),list())
    print(rgb_list)

or you could try to achive the same goal with a one-liner,

for i in range(len(green_list)):
    rgb_list = ['%02x%02x%02x' % (red_list[i][j], green_list[i][j], blue_list[i][j]) for j in range(len(green_list[i]))]
    print(rgb_list)

hope it will do the trick for you

edited Jun 16, 2022 at 9:16

answered Jun 16, 2022 at 8:04

Ohad Sharet

1,1521 gold badge9 silver badges16 bronze badges

3 Comments

Michael Szczesny Over a year ago

This answer seems well intentioned, but a comprehension isn't any more multi-threaded than a nested for-loop in CPython.

Nelewout Over a year ago

Nor is, for that matter, functools.reduce.

Ohad Sharet Over a year ago

hey, thank you for your output :), I would love to hear your opinion about my updated answer (solving the problem using reduce)?

Nelewout · Accepted Answer · 2022-06-16 09:22:40Z

1

In the code you show, the slow bit is the string formatting. That we can improve somewhat.

A hex colour consists of eight bits for the red field, eight for the green, and eight for the blue (since your data does not seem to have an alpha channel, I am going to ignore that option). So we need at least twenty four bits to store the rgb colours.

You can create hex values using numpy's bitwise operators. The advantage is that this is completely vectorised. You then only have one value to format into a hex string for each (i, j), instead of three:

for i in range(len(green_list)):
    hx = red_list[i] << 16 | green_list[i] << 8 | blue_list[i]
    hex_list = ['%06x' % val for val in hx]

When the numpy arrays have dimensions (10, 1_000_000), this is about 5.5x faster than your original method (on my machine).

edited Jun 16, 2022 at 9:22

answered Jun 16, 2022 at 9:00

Nelewout

6,6446 gold badges31 silver badges43 bronze badges

3 Comments

Michael Szczesny Over a year ago

Using this approach with pure numpy np.char.mod('%06x', red_list << 16 | green_list << 8 | blue_list) is ~2.5x slower even before iteratively writing to a file.

Nelewout Over a year ago

@MichaelSzczesny np.char.mod calls an internal numpy helper _vec_string for formatting, which does something that is fairly slow. I would keep the formatting out of numpy for that reason.

Michael Szczesny Over a year ago

Thought future readers would give it a shot anyway and a benchmark might save them some time. I'm always surprised when numpy functions are slower than simple python loops.

J. Choi · Accepted Answer · 2022-06-16 09:07:47Z

1. for-loop

Code modifications for rgb_list.append() does not affect much to the performance.

import timeit

n = 1000000
red_list   = [list(range(1, n+0)), list(range(1, n+2))]
green_list = [list(range(2, n+1)), list(range(2, n+3))]
blue_list  = [list(range(3, n+2)), list(range(3, n+4))]

def test_1():
    for i in range(len(green_list)):
        rgb_list = ['%02x%02x%02x' % (red_list[i][j], green_list[i][j], blue_list[i][j]) for j in range(len(green_list[i]))]

def test_2():
    for i in range(len(green_list)):
        rgb_list = [None] * len(green_list[i])
        
        for j in range(len(green_list[i])):
            rgb_list[j] = '%02x%02x%02x' % (red_list[i][j], green_list[i][j], blue_list[i][j])

def test_3():
    for i in range(len(green_list)):
        rgb_list = []
        
        for j in range(len(green_list[i])):
            rgb_list.append('%02x%02x%02x' % (red_list[i][j], green_list[i][j], blue_list[i][j]))

%timeit -n 1 -r 7 test_1(): 1.31 s ± 8.14 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit -n 1 -r 7 test_2(): 1.33 s ± 11.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit -n 1 -r 7 test_3(): 1.39 s ± 10.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

2. disk IO

Code modifications for disk IO also does not affect much to the performance.

n = 20000000

def test_write_each():
    for i in range(len(green_list)):
        rgb_list = ['%02x%02x%02x' % (red_list[i][j], green_list[i][j], blue_list[i][j]) for j in range(len(green_list[i]))]
        
        with open("test_%d" % i, "wb") as f:
            pickle.dump(rgb_list, f)

def test_write_once():
    
    rgb_list_list = [None] * len(green_list)
    
    for i in range(len(green_list)):
        rgb_list_list[i] = ['%02x%02x%02x' % (red_list[i][j], green_list[i][j], blue_list[i][j]) for j in range(len(green_list[i]))] 
        
    with open("test_all", "wb") as f:
        pickle.dump(rgb_list_list, f)

%timeit -n 1 -r 3 test_write_each(): 35.2 s ± 74.6 ms per loop (mean ± std. dev. of 3 runs, 1 loop each)

%timeit -n 1 -r 3 test_write_once(): 35.4 s ± 54.4 ms per loop (mean ± std. dev. of 3 runs, 1 loop each)

Conclusion

From the benchmark result, there seems like no bottleneck to be avoided in the question code.

If the disk IO itself is the problem, I would like to suggest to run the disk-IO code only once after every other job (including the ones that are not mentioned in this question) is finished.

Collectives™ on Stack Overflow

Optimization of numpy array iteration

4 Answers 4

Overcoming IO-bound bottleneck

Testing out the output

Benchmarks

Comments

3 Comments

3 Comments

1. for-loop

2. disk IO

Conclusion

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Overcoming IO-bound bottleneck

Testing out the output

Benchmarks

Comments

3 Comments

3 Comments

1. for-loop

2. disk IO

Conclusion

Comments

Your Answer

Sign up or log in

Post as a guest

Related