3

So I really give up on this.. I would like to pre-allocate a huge 2d-numpy array with shape(10000000,3) with one specific dtype per column.

Example:

    a         b        c     
 -------- --------- -------- 
  uint32   float32   uint8   
  ------   ------    ------  
  90       2.43      4       
  100      2.42      2       
  123      2.33      1   

So from the docs I can create a 2d array like this:

arr = np.zeros((4,3))                                                                                                                                                                                          
arr                                                                                                                                                                                                            
Out[6]: 
array([[0., 0., 0.],
       [0., 0., 0.],
       [0., 0., 0.],
       [0., 0., 0.]])

Good so far, but what about dtypes?

In [16]: arr.dtype                                                                                                                                                                                                     
Out[16]: dtype('float64')

All float - So lets define dtype:

dtype_L1 = np.dtype({'names': ['a', 'b', 'c'], 
               'formats': [np.uint32, np.float32, np.uint8]})

And compare both:

In [25]: arr_dtype = np.zeros((4,3), dtype=dtype_L1)                                                                                                                                                                   

In [26]: arr = np.zeros((4,3))                                                                                                                                                                                         

In [27]: arr[0,0]                                                                                                                                                                                                      
Out[27]: 0.0

In [28]: arr_dtype[0,0]                                                                                                                                                                                                
Out[28]: (0, 0., 0)

In [29]: type(arr_dtype[0,0])                                                                                                                                                                                          
Out[29]: numpy.void

In [30]: type(arr[0,0])                                                                                                                                                                                                
Out[30]: numpy.float64

In [31]: arr.shape                                                                                                                                                                                                     
Out[31]: (4, 3)

In [32]: arr_dtype.shape                                                                                                                                                                                               
Out[32]: (4, 3)

So - I do not see, why arr_dtype is not the same as arr, just with other dtype per column. Can somebody guide into a direction, please? It looks like I am creating an array with too high dimensions..:

**Update: One dimension too deep..? **

>>> arr[0,0]
0 ## Correct

>>> arr_dtype[0,0]
(0, 0., 0) 

It really holds the dtyped array here?! Looking one dimension deeper:

>>> type(arr_dtype[0,0][0])
<class 'numpy.uint32'>
>>> type(arr_dtype[0,0][1])
<class 'numpy.float32'>
>>> type(arr_dtype[0,0][2])
<class 'numpy.uint8'>
# all good - But one level too deep.
  • Expected: numpy is putting up a 4x3 matrix, where each element is a number. 12 numbers at all is correct.
  • Obvserved: numpy is putting up a 4x3 matrix where each element is a shape (3,) structure. So I have 4x3x3 fields = 36 numbers.

So is it possible to apply dtype in another way?

Final solution

You basically need to descide what is more important: Saving space or having all data in one array? One array can only have one dtype in it. So if you need different data types, go for multiple arrays with same length of Y-axis. Otherwise, create it simply like arr_dtype = np.zeros((4,3), dtype=np.float32) and make sure to set dtype to the correct type per array. Thanks for the comments!

6
  • arr_dtype and arr have different shape and dtype. The fields of one aren't the same as the columns of the other. Only the compound dtype allows a mix of dtype. Commented May 9, 2020 at 0:27
  • Sorry @hpaulj, but your comment did not helped me going forward. I would like to have a simple array: 3 columns, 4 rows. The zerost column have type unit32, first column float32 and second one unit8. I think it will be more clear if I could see how to do that. Commented May 9, 2020 at 21:26
  • You cannot have a "simple" array with different dtypes in each column. Commented May 9, 2020 at 23:41
  • @hpaulj Ok... So how can I than achive some structured array with three different column-wise dtypes? I still try to figure out why the code is wrong (based on your comment).. So from one of many examples it looks like the dtype property as applied here is correct? Happy for advice. Commented May 10, 2020 at 18:37
  • 1
    dt = np.dtype(...); arr = np.zeros((2000,), dtype=dt) makes the structured array. arr=np.zeros((2000,3), dtype=float) makes the 2d float array. Structured array makes most sense when one or more of the columns are string dtype, and/or a mix of float and int. It's really just an alternative to creating 3 separate arrays each with their own dtype. You can't do math across the fields, so there's little computational advantage to using the compound dtype. Commented May 10, 2020 at 19:19

2 Answers 2

3

Think of a row in your array as a single element. That's effectively what a compound dtype does for you. You can define your dtype as

d1 = np.dtype({'names': ['a', 'b', 'c'], 
               'formats': [np.uint32, np.float32, np.uint8]})

This means that you have a 3 column array. You allocate it with something like

arr = np.empty(10000, dtype=d1)

Substitute zeros for empty as you see fit. The result is effectively a (10000, 3) array, although it appears as a (10000,) array. You can extract views to individual columns using the field names, e.g.:

arr['a']
Sign up to request clarification or add additional context in comments.

5 Comments

Yes, but as I understand it correctly, this is actually not really a 2d array. It is 3 times a 1d array and therefore, computation over the different columns in the same row are not efficent (while they can be accessed using the [] getter.
@gies0r. You have a packed memory layout exactly as what you would expect, and as of numpy 1.16, accessing the fields returns a true view with strides properly adjusted to skip over the remaining fields. It's no more inefficient than taking the slice of a real 3D array
@gies0r. That being said, depending on the operation, it may be more efficient to maintain an Nx3 array of uniform type, or 3 separate contiguous arrays. Contiguity generally matters more than anything, even type conversion, in these cases.
My goal was to answer the question of how to create a dataframe-like array, not point out the reasons not to. After all, that is what you asked.
I agree to all of your points and yes, your solution works as expected. I was assuming that it should be very similar to pd.DataFrame, but I think here it is good to mention, that there is a difference between creating one big array or creating multiple smaller arrays on the same Y-axis with regard to their usage and computation time. Nevertheless, your solution works in cases where this is not needed - So I upvoted it.
0

arr_dtype is a structured array with fields (here 'a', 'b' and 'c') that you can assign different data types to them. Each element of the structured array is a structure, and elements of the structure (corresponding to fields) can have different datatypes (technically, elements of arr_dtype are all same type which is numpy.void structure in this case. It is the elements of that numpy.void object that can have different data types. In other words, numpy array elements are always homogeneous). arr on the other hand is an unstructured array that all elements have the same data type. Each element of unstructured array is a single object (in this case numbers).

Check your output to see the difference:

arr_dtype
[[(0, 0., 0) (0, 0., 0) (0, 0., 0)]
 [(0, 0., 0) (0, 0., 0) (0, 0., 0)]
 [(0, 0., 0) (0, 0., 0) (0, 0., 0)]
 [(0, 0., 0) (0, 0., 0) (0, 0., 0)]]

arr
[[0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]]

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.