Construct a numpy array with dtypes column wise

Question

So I really give up on this.. I would like to pre-allocate a huge 2d-numpy array with shape(10000000,3) with one specific dtype per column.

Example:

    a         b        c     
 -------- --------- -------- 
  uint32   float32   uint8   
  ------   ------    ------  
  90       2.43      4       
  100      2.42      2       
  123      2.33      1

So from the docs I can create a 2d array like this:

arr = np.zeros((4,3))                                                                                                                                                                                          
arr                                                                                                                                                                                                            
Out[6]: 
array([[0., 0., 0.],
       [0., 0., 0.],
       [0., 0., 0.],
       [0., 0., 0.]])

Good so far, but what about dtypes?

In [16]: arr.dtype                                                                                                                                                                                                     
Out[16]: dtype('float64')

All float - So lets define dtype:

dtype_L1 = np.dtype({'names': ['a', 'b', 'c'], 
               'formats': [np.uint32, np.float32, np.uint8]})

And compare both:

In [25]: arr_dtype = np.zeros((4,3), dtype=dtype_L1)                                                                                                                                                                   

In [26]: arr = np.zeros((4,3))                                                                                                                                                                                         

In [27]: arr[0,0]                                                                                                                                                                                                      
Out[27]: 0.0

In [28]: arr_dtype[0,0]                                                                                                                                                                                                
Out[28]: (0, 0., 0)

In [29]: type(arr_dtype[0,0])                                                                                                                                                                                          
Out[29]: numpy.void

In [30]: type(arr[0,0])                                                                                                                                                                                                
Out[30]: numpy.float64

In [31]: arr.shape                                                                                                                                                                                                     
Out[31]: (4, 3)

In [32]: arr_dtype.shape                                                                                                                                                                                               
Out[32]: (4, 3)

So - I do not see, why arr_dtype is not the same as arr, just with other dtype per column. Can somebody guide into a direction, please? It looks like I am creating an array with too high dimensions..:

Update: One dimension too deep..?

>>> arr[0,0]
0 ## Correct

>>> arr_dtype[0,0]
(0, 0., 0)

It really holds the dtyped array here?! Looking one dimension deeper:

>>> type(arr_dtype[0,0][0])
<class 'numpy.uint32'>
>>> type(arr_dtype[0,0][1])
<class 'numpy.float32'>
>>> type(arr_dtype[0,0][2])
<class 'numpy.uint8'>
# all good - But one level too deep.

Expected: numpy is putting up a 4x3 matrix, where each element is a number. 12 numbers at all is correct.
Obvserved: numpy is putting up a 4x3 matrix where each element is a shape (3,) structure. So I have 4x3x3 fields = 36 numbers.

So is it possible to apply dtype in another way?

Final solution

You basically need to descide what is more important: Saving space or having all data in one array? One array can only have one dtype in it. So if you need different data types, go for multiple arrays with same length of Y-axis. Otherwise, create it simply like arr_dtype = np.zeros((4,3), dtype=np.float32) and make sure to set dtype to the correct type per array. Thanks for the comments!

arr_dtype and arr have different shape and dtype. The fields of one aren't the same as the columns of the other. Only the compound dtype allows a mix of dtype. — hpaulj
– hpaulj, Commented May 9, 2020 at 0:27
Sorry @hpaulj, but your comment did not helped me going forward. I would like to have a simple array: 3 columns, 4 rows. The zerost column have type unit32, first column float32 and second one unit8. I think it will be more clear if I could see how to do that. — gies0r
– gies0r, Commented May 9, 2020 at 21:26
You cannot have a "simple" array with different dtypes in each column. — hpaulj
– hpaulj, Commented May 9, 2020 at 23:41
@hpaulj Ok... So how can I than achive some structured array with three different column-wise dtypes? I still try to figure out why the code is wrong (based on your comment).. So from one of many examples it looks like the dtype property as applied here is correct? Happy for advice. — gies0r
– gies0r, Commented May 10, 2020 at 18:37
dt = np.dtype(...); arr = np.zeros((2000,), dtype=dt) makes the structured array. arr=np.zeros((2000,3), dtype=float) makes the 2d float array. Structured array makes most sense when one or more of the columns are string dtype, and/or a mix of float and int. It's really just an alternative to creating 3 separate arrays each with their own dtype. You can't do math across the fields, so there's little computational advantage to using the compound dtype. — hpaulj
– hpaulj, Commented May 10, 2020 at 19:19

gies0r · Accepted Answer · 2020-05-11 03:05:17Z

3

Think of a row in your array as a single element. That's effectively what a compound dtype does for you. You can define your dtype as

d1 = np.dtype({'names': ['a', 'b', 'c'], 
               'formats': [np.uint32, np.float32, np.uint8]})

This means that you have a 3 column array. You allocate it with something like

arr = np.empty(10000, dtype=d1)

Substitute zeros for empty as you see fit. The result is effectively a (10000, 3) array, although it appears as a (10000,) array. You can extract views to individual columns using the field names, e.g.:

arr['a']

edited May 11, 2020 at 3:05

gies0r

5,2974 gold badges47 silver badges56 bronze badges

answered May 10, 2020 at 23:36

Mad Physicist

116k29 gold badges202 silver badges292 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

gies0r Over a year ago

Yes, but as I understand it correctly, this is actually not really a 2d array. It is 3 times a 1d array and therefore, computation over the different columns in the same row are not efficent (while they can be accessed using the [] getter.

Mad Physicist Over a year ago

@gies0r. You have a packed memory layout exactly as what you would expect, and as of numpy 1.16, accessing the fields returns a true view with strides properly adjusted to skip over the remaining fields. It's no more inefficient than taking the slice of a real 3D array

Mad Physicist Over a year ago

@gies0r. That being said, depending on the operation, it may be more efficient to maintain an Nx3 array of uniform type, or 3 separate contiguous arrays. Contiguity generally matters more than anything, even type conversion, in these cases.

Mad Physicist Over a year ago

My goal was to answer the question of how to create a dataframe-like array, not point out the reasons not to. After all, that is what you asked.

gies0r Over a year ago

I agree to all of your points and yes, your solution works as expected. I was assuming that it should be very similar to pd.DataFrame, but I think here it is good to mention, that there is a difference between creating one big array or creating multiple smaller arrays on the same Y-axis with regard to their usage and computation time. Nevertheless, your solution works in cases where this is not needed - So I upvoted it.

Ehsan · Accepted Answer · 2020-05-09 01:45:48Z

arr_dtype is a structured array with fields (here 'a', 'b' and 'c') that you can assign different data types to them. Each element of the structured array is a structure, and elements of the structure (corresponding to fields) can have different datatypes (technically, elements of arr_dtype are all same type which is numpy.void structure in this case. It is the elements of that numpy.void object that can have different data types. In other words, numpy array elements are always homogeneous). arr on the other hand is an unstructured array that all elements have the same data type. Each element of unstructured array is a single object (in this case numbers).

Check your output to see the difference:

arr_dtype
[[(0, 0., 0) (0, 0., 0) (0, 0., 0)]
 [(0, 0., 0) (0, 0., 0) (0, 0., 0)]
 [(0, 0., 0) (0, 0., 0) (0, 0., 0)]
 [(0, 0., 0) (0, 0., 0) (0, 0., 0)]]

arr
[[0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]]

Collectives™ on Stack Overflow

Construct a numpy array with dtypes column wise

Update: One dimension too deep..?

Final solution

2 Answers 2

5 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

**Update: One dimension too deep..? **

Final solution

2 Answers 2

5 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related

Update: One dimension too deep..?