7

Is there a way to vectorize an operation that takes several numpy arrays and puts them into a list of dictionaries?

Here's a simplified example. The real scenario might involve more arrays and more dictionary keys.

import numpy as np
x = np.arange(10)
y = np.arange(10, 20)
z = np.arange(100, 110)

print [dict(x=x[ii], y=y[ii], z=z[ii]) for ii in xrange(10)]

I might have thousands or hundreds of thousands of iterations in the xrange call. All the manipulation to create x, y, and z is vectorized (my example is not as simple as above). So, there's only 1 for loop left to get rid of, which I expect would result in huge speed ups.

I've tried using map with a function to create the dict and all sorts of other work arounds. It seems the Python for loop is the slow part (as usual). I'm sort of stuck to using dictionaries because of a pre-existing API requirement. However, solutions without dicts and record arrays or something would be interesting to see, but ultimately I don't think that will work with the existing API.

15
  • z=z[ii], good catch! Commented Nov 3, 2016 at 9:49
  • [dict(x=x_, y=y_, z=z_) for x_, y_, z_ in zip(x, y, z)] this is vectorised as far as pure Python goes. Commented Nov 3, 2016 at 9:50
  • Did you try with a list and dic comprehension ? is it too slow ? Commented Nov 3, 2016 at 9:50
  • 1
    I mean that the problem is maybe not the presence or absence of a loop but the construction of the list that causes repeated memory allocations as it grows. Commented Nov 3, 2016 at 10:15
  • 1
    @durden2.0 FYI, the for in listcomps have little to do with Python's general for. The former actually has a lower-level implementation and is faster than the later. Commented Nov 3, 2016 at 11:29

3 Answers 3

3

With your small example, I'm having trouble getting anything faster than the combination of list and dictionary comprehensions

In [105]: timeit [{'x':i, 'y':j, 'z':k} for i,j,k in zip(x,y,z)]
100000 loops, best of 3: 15.5 µs per loop
In [106]: timeit [{'key':{'x':i, 'y':j, 'z':k}} for i,j,k in zip(x,y,z)]
10000 loops, best of 3: 37.3 µs per loop

The alternatives that use array concatenation to join the arrays before partitioning are slower.

In [108]: timeit [{'x':x_, 'y':y_, 'z':z_} for x_, y_, z_ in np.column_stack((x,y,z))]
....
10000 loops, best of 3: 58.2 µs per loop

=======================

A structured array is easiest with recfunctions:

In [109]: from numpy.lib import recfunctions
In [112]: M=recfunctions.merge_arrays((x,y,z))
In [113]: M.dtype.names=['x','y','z']
In [114]: M
Out[114]: 
array([(0, 10, 100), (1, 11, 101), (2, 12, 102), (3, 13, 103),
       (4, 14, 104), (5, 15, 105), (6, 16, 106), (7, 17, 107),
       (8, 18, 108), (9, 19, 109)], 
      dtype=[('x', '<i4'), ('y', '<i4'), ('z', '<i4')])
In [115]: M['x']
Out[115]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

Time it much slower, but if you want to access all the x values at once, it's much better than fetching them from all the dictionaries.

np.rec.fromarrays((x,y,z),names=['x','y','z'])

produces a recarray with given names. About the same speed.

I could also construct an empty array of the right dtype and shape and copy the arrays to it. That's probably as fast as this merge but more complicated to describe.

I'd suggest optimizing the data structure for use/access rather than construction speed. Generally you construct it once, and use it many times.

============

In [125]: dt=np.dtype([('x',x.dtype),('y',y.dtype),('z',z.dtype)])
In [126]: xyz=np.zeros(x.shape,dtype=dt)
In [127]: xyz['x']=x; xyz['y']=y; xyz['z']=z
# or for n,d in zip(xyz.dtype.names, (x,y,z)): xyz[n] = d
In [128]: xyz
Out[128]: 
array([(0, 10, 100), (1, 11, 101), (2, 12, 102), (3, 13, 103),
       (4, 14, 104), (5, 15, 105), (6, 16, 106), (7, 17, 107),
       (8, 18, 108), (9, 19, 109)], 
      dtype=[('x', '<i4'), ('y', '<i4'), ('z', '<i4')])
Sign up to request clarification or add additional context in comments.

2 Comments

Using a structure array is a more proper way to go, but there is no need to use recfunctions.
I added the non-recfunctions version.
2

Here is one (Num)?Pythonic way:

In [18]: names = np.array(['x', 'y', 'z'])
In [38]: map(dict, np.dstack((np.repeat(names[None, :], 10, axis=0), np.column_stack((x, y, z)))))
Out[38]: 
[{'x': '0', 'y': '10', 'z': '100'},
 {'x': '1', 'y': '11', 'z': '101'},
 {'x': '2', 'y': '12', 'z': '102'},
 {'x': '3', 'y': '13', 'z': '103'},
 {'x': '4', 'y': '14', 'z': '104'},
 {'x': '5', 'y': '15', 'z': '105'},
 {'x': '6', 'y': '16', 'z': '106'},
 {'x': '7', 'y': '17', 'z': '107'},
 {'x': '8', 'y': '18', 'z': '108'},
 {'x': '9', 'y': '19', 'z': '109'}]

Also, note that if you don't need all of the dictionaries at once, you can simply create a generator and access to each item on demand.

(dict(x=x[ii], y=y[ii], z=z[ii]) for ii in xrange(10))

If you want a nested dictionary, I suggest a list comprehension:

In [88]: inner = np.dstack((np.repeat(names[None, :], 10, axis=0), np.column_stack((x, y))))

In [89]: [{'connection': d} for d in map(dict, inner)]
Out[89]: 
[{'connection': {'x': '0', 'y': '10'}},
 {'connection': {'x': '1', 'y': '11'}},
 {'connection': {'x': '2', 'y': '12'}},
 {'connection': {'x': '3', 'y': '13'}},
 {'connection': {'x': '4', 'y': '14'}},
 {'connection': {'x': '5', 'y': '15'}},
 {'connection': {'x': '6', 'y': '16'}},
 {'connection': {'x': '7', 'y': '17'}},
 {'connection': {'x': '8', 'y': '18'}},
 {'connection': {'x': '9', 'y': '19'}}]

10 Comments

Nice! What about if I need to create nested dictionaries. For example, {'connection': {'x': x[ii], 'y': y[ii]}}. I realized I simplified my question a bit too much for my scenario.
@durden2.0 You mean, you want a nested dictionary for all the items? and without z? or it's optional?
Yes nested dictionary for all the items, z is required in my solution but doesn't really matter too much for the solution itself because the key is a nested dictionary without doing a python for loop.
@durden2.0 List comprehension's for is not like pythons for loop, cause its iteration has been implemented in C. But if you want to do this in Numpy ater all, since you want a python object, your code needs to have interaction with the upper level (python) and this means you cant write a pure numpythonic code.
@Kasramvd "List comprehension's for is not like pythons for loop, cause its iteration has been implemented in C", really? I wonder why they are slow then?. One more relevant link - stackoverflow.com/questions/22108488/…
|
1

Here's an approach using a mix of NumPy and Pandas -

# Stack into columns & create a pandas dataframe with appropriate col names
a = np.column_stack((x.ravel(),y.ravel(),z.ravel()))
df = pd.DataFrame(a,columns=[['x','y','z']])

# Convert to list of dicts
out = df.T.to_dict().values()

Sample run -

In [52]: x
Out[52]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [53]: y
Out[53]: array([10, 11, 12, 13, 14, 15, 16, 17, 18, 19])

In [54]: z
Out[54]: array([100, 101, 102, 103, 104, 105, 106, 107, 108, 109])

In [55]: out
Out[55]: 
[{'x': 0, 'y': 10, 'z': 100},
 {'x': 1, 'y': 11, 'z': 101},
 {'x': 2, 'y': 12, 'z': 102},
 {'x': 3, 'y': 13, 'z': 103},
 {'x': 4, 'y': 14, 'z': 104},
 {'x': 5, 'y': 15, 'z': 105},
 {'x': 6, 'y': 16, 'z': 106},
 {'x': 7, 'y': 17, 'z': 107},
 {'x': 8, 'y': 18, 'z': 108},
 {'x': 9, 'y': 19, 'z': 109}]

2 Comments

Clever solution! I simplified my setup, but in reality I'm creating a nested dictionary, so I'll try to tweak this a bit. For example, the list of dicts I'm returning is actually like [{{'connection': {'xy': x[ii], 'yy': y[ii]}}].
@durden2.0 Could you edit the loop comprehension code in the question for that requirement?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.