1

I would like to create a NumPy array with a somewhat repetitive structure: A particular function (here, as an example, shuffle()), takes two numbers and returns an array (here with length 8, could be more though). These arrays are then concatenated.

import numpy


def shuffle(a, b):
    return numpy.array([
        [+a, +b], [-a, +b], [+a, -b], [-a, -b],
        [+b, +a], [-b, +a], [+b, -a], [-b, -a],
        ])


pairs = [
    (0.1, 0.2),
    (3.14, 2.71), 
    # ... many, without a particular pattern ...
    (0.707, 0.577)
    ]
out = numpy.concatenate([shuffle(*pair) for pair in pairs])

I suppose what happens here is that all subarrays of length 8 are independently created in memory, just to be copied over right away to form the larger array out. This gets needlessly inefficient when there are lots of pairs (a, b) or when shuffle is replaced by something that returns more data.

One way around this would be to hardcode out à la

out = numpy.array([
    [+0.1, +0.2],
    [-0.1, +0.2],
    # ...
    [-0.2, -0.1],
    [+3.14, +2.71],
    # ...
    ])

but that's obviously not desirable either.

In C, I'd perhaps use a macro parsed by the preprocessor.

Any hints on how to arrange the above code to avoid unnecessary copies?

4
  • you can probably do it using matrix operations much more efficiently, i would expect. Commented Aug 3, 2017 at 16:14
  • Sounds like something built for itertools.permutations Commented Aug 3, 2017 at 16:16
  • If you allocate an empty array np.empty(dims) then fill it block-by-block, that would avoid it. Commented Aug 3, 2017 at 16:57
  • After you create a bunch of out arrays, are you going to add them up, or do something else? Maybe a better solution is to not make these in the first place. Commented Aug 4, 2017 at 7:37

4 Answers 4

1

This:

   [
    [+a, +b], [-a, +b], [+a, -b], [-a, -b],
    [+b, +a], [-b, +a], [+b, -a], [-b, -a],
    ]

is a list of lists. Hard coding the numbers makes little difference.

np.array(...) then converts the list to an array.

np.fromiterable tends to be faster, but only works with 1d data, thus requiring reshaping.

Is this step really that big of a time consumer?

Some time explorations:

In [245]: timeit shuffle(1,2)
9.29 µs ± 12.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
...
In [248]: out=np.concatenate([shuffle(1,2) for _ in range(100)])
In [249]: out.shape
Out[249]: (800, 2)
In [250]: timeit out=np.concatenate([shuffle(1,2) for _ in range(100)])
1.02 ms ± 4.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Generating the same size array, but with a simpler concatenation. This might the optional speed if it generated the right numbers:

In [251]: np.stack([np.arange(800),np.arange(800)],1).shape
Out[251]: (800, 2)
In [252]: timeit np.stack([np.arange(800),np.arange(800)],1).shape
21.4 µs ± 902 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

We could explore alternatives, but at some level you want to give priority to clarity. What's the clearest way of generating the desired array?

Let's try it without the intermediate array call

def shuffle1(a, b):
    return [
        [+a, +b], [-a, +b], [+a, -b], [-a, -b],
        [+b, +a], [-b, +a], [+b, -a], [-b, -a],
        ]

In [259]: timeit np.array([shuffle1(1,2) for _ in range(100)]).reshape(-1,2)
765 µs ± 14.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

1ms v .75ms - a modest speed improvement.

Using fromiter instead of np.array in shuffle cuts time in half:

def shuffle2(a, b):
    return np.fromiter(
        [+a, +b, -a, +b, +a, -b, -a, -b,
        +b, +a, -b, +a, +b, -a, -b, -a,
        ],int).reshape(-1,2)

In [279]: timeit out=np.concatenate([shuffle2(1,2) for _ in range(100)])
503 µs ± 4.56 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Sign up to request clarification or add additional context in comments.

6 Comments

Ah that's right, in numpy.array([a, b, c]), the first thing that happens is that the list [a, b, c] is created. Perhaps instead of returning a numpy.array from shuffle(), I could return a list such that I later on only have to np.array once.
Ha! I just checked and found that, much to my surprise, np.concatenate is actually slower on a list of lists than on a list of np.arrays, even if the lists are converted to np.arrays first.
Using np.array(...) on the bigger nested list of lists (of lists) gives a modest speed improvement. See my edits.
What do you get if instead of array and reshape, you do a concatenate?
Concatenate is about the same as the original. When given a list of lists, concatenate has to first turn each sublist into an array, and then do the array concatenate. So the number of calls to np.array is the same.
|
1

Here's a method that uses fancy indexing.

pairs is your sample input, stored in a numpy array:

In [7]: pairs
Out[7]: 
array([[ 0.1  ,  0.2  ],
       [ 3.14 ,  2.71 ],
       [ 0.707,  0.577]])

pairspm is an array whose rows are [a, b, -a, -b].

In [8]: pairspm = np.hstack((pairs, -pairs))

The values in indices are the indices into an array of the form [a, b, -a, -b] corresponding to the 8x2 pattern in shuffle(a, b):

In [9]: indices = np.array([[0, 1], [2, 1], [0, 3], [2, 3], [1, 0], [3, 0], [1, 2], [3, 2]])

out is now just fancy indexing of pairspm, followed by a reshape to collapse the first two dimensions of pairspm[:, indices] into one:

In [10]: out = pairspm[:, indices].reshape(-1, 2)

In [11]: out
Out[11]: 
array([[ 0.1  ,  0.2  ],
       [-0.1  ,  0.2  ],
       [ 0.1  , -0.2  ],
       [-0.1  , -0.2  ],
       [ 0.2  ,  0.1  ],
       [-0.2  ,  0.1  ],
       [ 0.2  , -0.1  ],
       [-0.2  , -0.1  ],
       [ 3.14 ,  2.71 ],
       [-3.14 ,  2.71 ],
       [ 3.14 , -2.71 ],
       [-3.14 , -2.71 ],
       [ 2.71 ,  3.14 ],
       [-2.71 ,  3.14 ],
       [ 2.71 , -3.14 ],
       [-2.71 , -3.14 ],
       [ 0.707,  0.577],
       [-0.707,  0.577],
       [ 0.707, -0.577],
       [-0.707, -0.577],
       [ 0.577,  0.707],
       [-0.577,  0.707],
       [ 0.577, -0.707],
       [-0.577, -0.707]])

(With a little more work, you could eliminate the need for pairspm.)

Comments

0

If you know the dimensions beforehand, you can allocate an empty array then fill it up. Assuming you know the length of pairs, you know the final array size from the start, then we can stride over the array in a "flat" view in blocks of 16 and fill it up.

def gen(pairs):
    out = np.empty((8 * len(pairs), 2), dtype=float)
    for n, (a, b) in enumerate(pairs):
        out.flat[16*n:16*(n+1)] = [
            +a, +b, -a, +b, +a, -b, -a, -b,
            +b, +a, -b, +a, +b, -a, -b, -a,
        ]
    return out

Comments

0

Here is another approach that builds the entire output result without stacking individual arrays:

import numpy as np
# generate some data:
pairs = np.random.randint(1, 100, (1000, 2))
# create "sign" array:
u = np.array([[[1, 1], [-1, 1], [1, -1], [-1, -1]]])
# create full output array:
out = (pairs[:, None, :] * u).reshape((-1, 2))

Timing:

%timeit (pairs[:, None, :] * u).reshape((-1, 2))
10000 loops, best of 3: 49 µs per loop

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.