How to create numpy.ndarray from tuple iteration

Question

I have the following loop

# `results` are obtained from some mySQldb command.

for row in results:
    print row

Which prints the tuples like this:

('1A34', 'RBP', 0.0, 1.0, 0.0, 0.0, 0.0, 0.0)
('1A9N', 'RBP', 0.0456267, 0.0539268, 0.331932, 0.0464031, 4.41336e-06, 0.522107)
('1AQ3', 'RBP', 0.0444479, 0.201112, 0.268581, 0.0049757, 1.28505e-12, 0.480883)
('1AQ4', 'RBP', 0.0177232, 0.363746, 0.308995, 0.00169861, 0.0, 0.307837)

My question is from that iteration how can I create a bumpy nd.array that looks like this:

array([['1A34', 'RBP', 0.0, 1.0, 0.0, 0.0, 0.0, 0.0],
       ['1A9N', 'RBP', 0.0456267, 0.0539268, 0.331932, 0.0464031, 4.41336e-06, 0.522107],
       ['1AQ3', 'RBP', 0.0444479, 0.201112, 0.268581, 0.0049757, 1.28505e-12, 0.480883],
       ['1AQ4', 'RBP', 0.0177232, 0.363746, 0.308995, 0.00169861, 0.0, 0.307837]])

At the end the ndarray will have shape: (4,8)

Do you need to have str and float in one array? It can be done by structured array but it is not the ideal solution. Normal array only allow one type (dtype as it is known). Considering using pandas? — CT Zhu
– CT Zhu, Commented Jun 30, 2014 at 3:24
If results is a generator, you will need to convert it to a list first. The reason is that numpy arrays need to know their size at creation time. If you know the number of elements in results, then you can do something like a = numpy.empty((n, 8), dtype='object'), followed by: for i, row in enumerate(results): a[i] = row. — Alok Singhal
– Alok Singhal, Commented Jun 30, 2014 at 3:37
@AlokSinghal, not entirely true, there is a numpy.fromiter function. — CT Zhu
– CT Zhu, Commented Jun 30, 2014 at 3:46
@CTZhu thanks for mentioning that. Although it seems like fromiter reallocates the array for every new element unless count is specified. Edit: just looked at the source code and it seems to do a 50% growth at every new allocation, so it might not be as bad as I thought. — Alok Singhal
– Alok Singhal, Commented Jun 30, 2014 at 3:57

CT Zhu · Accepted Answer · 2014-06-30 03:38:42Z

2

Read it into a structured array:

In [30]:
a=[('1A34', 'RBP', 0.0, 1.0, 0.0, 0.0, 0.0, 0.0),
   ('1A9N', 'RBP', 0.0456267, 0.0539268, 0.331932, 0.0464031, 4.41336e-06, 0.522107),
   ('1AQ3', 'RBP', 0.0444479, 0.201112, 0.268581, 0.0049757, 1.28505e-12, 0.480883),
   ('1AQ4', 'RBP', 0.0177232, 0.363746, 0.308995, 0.00169861, 0.0, 0.307837)]
np.array(a, dtype=('a10,a10,f4,f4,f4,f4,f4,f4'))
Out[30]:
array([('1A34', 'RBP', 0.0, 1.0, 0.0, 0.0, 0.0, 0.0),
       ('1A9N', 'RBP', 0.045626699924468994, 0.053926799446344376, 0.331932008266449, 0.04640309885144234, 4.413359874888556e-06, 0.5221070051193237),
       ('1AQ3', 'RBP', 0.044447898864746094, 0.20111200213432312, 0.26858100295066833, 0.004975699819624424, 1.2850499744171406e-12, 0.48088300228118896),
       ('1AQ4', 'RBP', 0.01772320084273815, 0.3637459874153137, 0.30899500846862793, 0.0016986100235953927, 0.0, 0.30783700942993164)], 
      dtype=[('f0', 'S10'), ('f1', 'S10'), ('f2', '<f4'), ('f3', '<f4'), ('f4', '<f4'), ('f5', '<f4'), ('f6', '<f4'), ('f7', '<f4')])

You can have all of them in object dtype:

In [46]:

np.array(a, dtype=object)
Out[46]:
array([['1A34', 'RBP', 0.0, 1.0, 0.0, 0.0, 0.0, 0.0],
       ['1A9N', 'RBP', 0.0456267, 0.0539268, 0.331932, 0.0464031,
        4.41336e-06, 0.522107],
       ['1AQ3', 'RBP', 0.0444479, 0.201112, 0.268581, 0.0049757,
        1.28505e-12, 0.480883],
       ['1AQ4', 'RBP', 0.0177232, 0.363746, 0.308995, 0.00169861, 0.0,
        0.307837]], dtype=object)

but it is not ideal for the float values, also it may lead to undesired behaviors:

In [48]:
b=np.array(a, dtype=object)
b[0]+b[1] #addition for float values and concatenation for string values
Out[48]:
array(['1A341A9N', 'RBPRBP', 0.0456267, 1.0539268, 0.331932, 0.0464031,
       4.41336e-06, 0.522107], dtype=object)

pandas is also an alternative:

In [43]:
import pandas as pd
print pd.DataFrame(a)
      0    1         2         3         4         5             6         7
0  1A34  RBP  0.000000  1.000000  0.000000  0.000000  0.000000e+00  0.000000
1  1A9N  RBP  0.045627  0.053927  0.331932  0.046403  4.413360e-06  0.522107
2  1AQ3  RBP  0.044448  0.201112  0.268581  0.004976  1.285050e-12  0.480883
3  1AQ4  RBP  0.017723  0.363746  0.308995  0.001699  0.000000e+00  0.307837
In [44]:

pd.DataFrame(a).dtypes
Out[44]:
0     object
1     object
2    float64
3    float64
4    float64
5    float64
6    float64
7    float64
dtype: object

and it allows columns to have different dtype

edited Jun 30, 2014 at 3:38

answered Jun 30, 2014 at 3:31

CT Zhu

54.6k18 gold badges125 silver badges136 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

neversaint Over a year ago

Thanks for the panda suggestion. But I need numpy as required by scikit-learn.

CT Zhu Over a year ago

You are welcome, in that case I will recommend coding the string values to dummy variables or factors (0, 1, 2, 3... ), so every thing can be fit into just ordinary numpy array of float dtype

neversaint Over a year ago

the undesired behaviour is it solely for object data type? If I hardcode using your suggestion

dtype=[('f0', 'S10'), ('f1', 'S10'), ('f2', '<f4'), ('f3', '<f4'), ('f4', '<f4'), ('f5', '<f4'), ('f6', '<f4'), ('f7', '<f4')])

, that side effect shouldn't occur right?

neversaint Over a year ago

By the way the shape is (4,) not (4,8). How cani I do it properly to get the latter shape?

CT Zhu Over a year ago

Yeap, once you have the data in a structured array, the shape becomes (4,). 8 disappeared (and instead you now have 8 fields, f0 to f7).

Collectives™ on Stack Overflow

How to create numpy.ndarray from tuple iteration

1 Answer 1

5 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related