Converting a data structure of dtype=object to numpy array of dtype=float64

Question

I am trying to convert 'feature1' array from the following data structure into a numpy array so I can input it to sklearn. However, I am running in circles as it always tells me that dtype=object is unsuitable, and I am not able to convert it to the desired float64 format.

I want to extract all the 'feature1' as a list of numpy arrays of dtype=float64, instead of dtype=object from the following structure.

vec is an object returned from an earlier computation.

>>>vec
[{'is_Primary': 1, 'feature1': [2, 2, 2, 0, 0.03333333333333333, 0], 'object_id': ObjectId('557beda51d41c8e4d1aeac25'), 'vectorized': 1},
{'is_Primary': 0, 'feature1': [2, 2, 1, 0, 0.5, 0], 'object_id': ObjectId('557beda51d41c8e4d1aeac25'), 'vectorized': 1}]

I tried the following:

>>> t = np.array(list(vec))
>>> t
>>>>array([ {'is_Primary': 0, 'feature1': [], 'object_id': ObjectId('557bcd881d41c8d9c5f5822f'), 'vectorized': 1},
   {'is_Primary': 0, 'feature1': [], 'object_id': ObjectId('557bcd881d41c8d9c5f58233'), 'vectorized': 1},
   {'is_Primary': 0, 'feature1': [], 'object_id': ObjectId('557bcd881d41c8d9c5f58237'), 'vectorized': 1},
   ...,
   {'is_Primary': 0, 'feature1': [], 'object_id': ObjectId('557beda61d41c8e4d1aead1f'), 'vectorized': 1},
   {'is_Primary': 1, 'feature1': [2, 2, 0, 0], 'object_id': ObjectId('557beda61d41c8e4d1aead1d'), 'vectorized': 1},
   {'is_Primary': 1, 'feature1': [], 'object_id': ObjectId('557beda61d41c8e4d1aead27'), 'vectorized': 1}], dtype=object)

Also,

>>> array = np.array([x['feature1'] for x in vec])

as suggested by another user, gives a similar output:

>>> array
>>> array([[], [], [], ..., [], [2, 2, 0, 0], []], dtype=object)

I know I can access the contents of 'feature1' using array[i], but what I want is to convert the dtype=object to dtype=float64, and made into a list/dict in which each row will have the 'feature1'of the corresponding entry from vec.

I also tried using a pandas dataframe, but to no avail.

    >>>>pandaseries = pd.Series(df['feature1']).convert_objects(convert_numeric=True)
    >>>>pandaseries
0     []
1     []
2     []
3     []
4     []
5     []
6     []
7     []
8     []
9     []
10    []
11    []
12    []
13    []
14    []
...
7021                                                   []
7022    [2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 12, 2, 24...
7023                                                   []
7024                                                   []
7025                                                   []
7026                                                   []
7027                                                   []
7028    [2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 12, 2, 24...
7029                                                   []
7030    [2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 12, 2, 24...
7031                                                   []
7032                                       [2, 2, 0.1, 0]
7033                                                   []
7034                                         [2, 2, 0, 0]
7035                                                   []
Name: feature1, Length: 7036, dtype: object
    >>>

Again, dtype: object is returned. My guess would be to loop over each row and print a list out. But I am unable to do that. Maybe it is a newbie question. What am I doing wrong?

Thanks.

vec contains two dictionaries, each has a 'feature1' item. Which one do you want? — wwii
– wwii, Commented Jun 20, 2015 at 15:48
If you don't know how to access the value of a dictionary item, maybe you should spend some time with The Tutorial in the docs. — wwii
– wwii, Commented Jun 20, 2015 at 15:53

hpaulj · Accepted Answer · 2015-06-21 03:55:01Z

2

Lets take as the starting point a list of lists or equivalently an object array of lists:

A = [[], [], [], [1,2,1], [], [2, 2, 0, 0], []]
A = array([[], [], [], [1,2,1], [], [2, 2, 0, 0], []], dtype=object)

If the sublists were all the same length, np.array([...]) would give you a 2d array, one row for each sublist, and columns matching their common length. But since they are unequal in length, it can only make it a 1d array, where each element is a pointer to one of these sublists - i.e. dtype=object.

I can imagine 2 ways of constructing a 2d array:

pad each sublist to a common length
insert each sublist into an empty array of the appropriate size.

Basically it requires common Python iteration; it's not a common enough task to have a wiz-bang numpy function.

For example:

In [346]: n=len(A)
In [348]: m=max([len(x) for x in A])
In [349]: AA=np.zeros((n,m),int)
In [350]: for i,x in enumerate(A):
   .....:     AA[i,:len(x)] = x
In [351]: AA
Out[351]: 
array([[0, 0, 0, 0],
       [0, 0, 0, 0],
       [0, 0, 0, 0],
       [1, 2, 1, 0],
       [0, 0, 0, 0],
       [2, 2, 0, 0],
       [0, 0, 0, 0]])

To get a sparse matrix:

In [352]: from scipy import sparse
In [353]: MA=sparse.coo_matrix(AA)
In [354]: MA
Out[354]: 
<7x4 sparse matrix of type '<class 'numpy.int32'>'
    with 5 stored elements in COOrdinate format>

Nothing magical, just straight forward sparse matrix construction. I suppose you could bypass the dense matrix

There is a list-of-lists sparse format that looks a bit like your data.

In [356]: Ml=MA.tolil()

In [357]: Ml.rows
Out[357]: array([[], [], [], [0, 1, 2], [], [0, 1], []], dtype=object)

In [358]: Ml.data
Out[358]: array([[], [], [], [1, 2, 1], [], [2, 2], []], dtype=object)

Conceivably you could construct an empty sparse.lil_matrix((n,m)) matrix, and set it's .data attribute directly. But you'd also have to calculate the rows attribute.

You could also look at the data, row. col attributes of the coo format matrix, and decide it would be easy to construct the equivalent from your A list of lists.

One way or other you have to decide how the non-zero rows get padded to the full length.

edited Jun 21, 2015 at 3:55

answered Jun 21, 2015 at 3:03

hpaulj

233k14 gold badges260 silver badges392 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Shantanu Ghosh Over a year ago

Thanks! Basically this solves my problem. (stackoverflow.com/questions/16600397/…)

Cecil Curry Over a year ago

This is the canonical answer. But wait: that's not all. This answer also serves as a readable introduction to numpy-fication of lists-of-lists (i.e., coercion of ragged Python lists into uniform NumPy matrices) and sparsification of matrices (i.e., coercion of uniform NumPy matrices into their sparse equivalents). In short, this answer exemplifies why I regularly return to StackOverflow. Bravissimo!

iAdjunct · Accepted Answer · 2015-06-20 15:27:18Z

0

This:

array = numpy.array ( [ x['feature1'] for x in ver ] )

Or you need to be more clear in your example...

answered Jun 20, 2015 at 15:27

iAdjunct

3,1191 gold badge23 silver badges30 bronze badges

3 Comments

Kasravnd Over a year ago

You must explain your answer, also what about OP wants the 'feature1'values in a separate numpy array?

hpaulj Over a year ago

The OP wants the values as rows of an array. That's what this answer produces. But whether all feature1 values have the same length makes a big difference in the resulting array type.

Shantanu Ghosh Over a year ago

@hpaulj yes the length of each row is different, and i was hoping this would give me a sparse numpy array that would be accessible to scipy.sparse

wwii · Accepted Answer · 2015-06-21 05:30:27Z

0

You can access the value of a dictionary item by using its key:

d ={'a':1}
d['a'] --> 1

To access items in a list, you can iterate over it or use its index

a = [1,  2]

for thing in a:
    # do something with thing

a[0]  --> 1

map conveniently applies a function to all the items of an iterable and returns a list of the results. operator.getitem returns a function that will retrieve an item from an object.

import operator
import numpy as np
feature1 = operator.getitem('feature1')
a = np.asarray(map(feature1, vec))

vec = [{'is_Primary': 1, 'feature1': [2, 2, 2, 0, 0.03333333333333333, 0], 'object_id': ObjectId('557beda51d41c8e4d1aeac25'), 'vectorized': 1},
       {'is_Primary': 0, 'feature1': [2, 2, 1, 0, 0.5, 0], 'object_id': ObjectId('557beda51d41c8e4d1aeac25'), 'vectorized': 1}]

>>> a = np.asanyarray(map(feature1, vec))
>>> a.shape
(2, 6)
>>> print a
[[ 2.          2.          2.          0.          0.03333333  0.        ]
 [ 2.          2.          1.          0.          0.5         0.        ]]
>>> 
>>> for thing in a[1,:]:
    print type(thing)

<type 'numpy.float64'>
<type 'numpy.float64'>
<type 'numpy.float64'>
<type 'numpy.float64'>
<type 'numpy.float64'>
<type 'numpy.float64'>
>>>

edited Jun 21, 2015 at 5:30

answered Jun 20, 2015 at 16:18

wwii

23.9k7 gold badges42 silver badges80 bronze badges

4 Comments

hpaulj Over a year ago

Maybe it's just a matter of style, but I think [ x['feature1'] for x in ver ] is more idiomatic, but functionally the same, as your map(operator...).

wwii Over a year ago

@hpaulj .. List comprehensions are cool and a lot of ppl (most) seem to think they are THE idiomatic form. For some reason I like map - but do use list comprehensions. Sometimes list comprehensions are faster sometimes map is faster (if time is important).

Shantanu Ghosh Over a year ago

@wwii Your suggestion provided a string.

wwii Over a year ago

@ShantanuGhosh - Using the example for vec that was provided, I get an ndarray of shape (2,6) and type float64. Is your data different than the example in the question?

Collectives™ on Stack Overflow

Converting a data structure of dtype=object to numpy array of dtype=float64

3 Answers 3

2 Comments

3 Comments

4 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

3 Comments

4 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related