0

I am trying to convert 'feature1' array from the following data structure into a numpy array so I can input it to sklearn. However, I am running in circles as it always tells me that dtype=object is unsuitable, and I am not able to convert it to the desired float64 format.

I want to extract all the 'feature1' as a list of numpy arrays of dtype=float64, instead of dtype=object from the following structure.

vec is an object returned from an earlier computation.

>>>vec
[{'is_Primary': 1, 'feature1': [2, 2, 2, 0, 0.03333333333333333, 0], 'object_id': ObjectId('557beda51d41c8e4d1aeac25'), 'vectorized': 1},
{'is_Primary': 0, 'feature1': [2, 2, 1, 0, 0.5, 0], 'object_id': ObjectId('557beda51d41c8e4d1aeac25'), 'vectorized': 1}]

I tried the following:

>>> t = np.array(list(vec))
>>> t
>>>>array([ {'is_Primary': 0, 'feature1': [], 'object_id': ObjectId('557bcd881d41c8d9c5f5822f'), 'vectorized': 1},
   {'is_Primary': 0, 'feature1': [], 'object_id': ObjectId('557bcd881d41c8d9c5f58233'), 'vectorized': 1},
   {'is_Primary': 0, 'feature1': [], 'object_id': ObjectId('557bcd881d41c8d9c5f58237'), 'vectorized': 1},
   ...,
   {'is_Primary': 0, 'feature1': [], 'object_id': ObjectId('557beda61d41c8e4d1aead1f'), 'vectorized': 1},
   {'is_Primary': 1, 'feature1': [2, 2, 0, 0], 'object_id': ObjectId('557beda61d41c8e4d1aead1d'), 'vectorized': 1},
   {'is_Primary': 1, 'feature1': [], 'object_id': ObjectId('557beda61d41c8e4d1aead27'), 'vectorized': 1}], dtype=object)

Also,

>>> array = np.array([x['feature1'] for x in vec])

as suggested by another user, gives a similar output:

>>> array
>>> array([[], [], [], ..., [], [2, 2, 0, 0], []], dtype=object)

I know I can access the contents of 'feature1' using array[i], but what I want is to convert the dtype=object to dtype=float64, and made into a list/dict in which each row will have the 'feature1'of the corresponding entry from vec.

I also tried using a pandas dataframe, but to no avail.

    >>>>pandaseries = pd.Series(df['feature1']).convert_objects(convert_numeric=True)
    >>>>pandaseries
0     []
1     []
2     []
3     []
4     []
5     []
6     []
7     []
8     []
9     []
10    []
11    []
12    []
13    []
14    []
...
7021                                                   []
7022    [2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 12, 2, 24...
7023                                                   []
7024                                                   []
7025                                                   []
7026                                                   []
7027                                                   []
7028    [2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 12, 2, 24...
7029                                                   []
7030    [2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 12, 2, 24...
7031                                                   []
7032                                       [2, 2, 0.1, 0]
7033                                                   []
7034                                         [2, 2, 0, 0]
7035                                                   []
Name: feature1, Length: 7036, dtype: object
    >>> 

Again, dtype: object is returned. My guess would be to loop over each row and print a list out. But I am unable to do that. Maybe it is a newbie question. What am I doing wrong?

Thanks.

3
  • vec contains two dictionaries, each has a 'feature1' item. Which one do you want? Commented Jun 20, 2015 at 15:48
  • If you don't know how to access the value of a dictionary item, maybe you should spend some time with The Tutorial in the docs. Commented Jun 20, 2015 at 15:53
  • I want both as the rows of a numpy array. Commented Jun 20, 2015 at 15:54

3 Answers 3

2

Lets take as the starting point a list of lists or equivalently an object array of lists:

A = [[], [], [], [1,2,1], [], [2, 2, 0, 0], []]
A = array([[], [], [], [1,2,1], [], [2, 2, 0, 0], []], dtype=object)

If the sublists were all the same length, np.array([...]) would give you a 2d array, one row for each sublist, and columns matching their common length. But since they are unequal in length, it can only make it a 1d array, where each element is a pointer to one of these sublists - i.e. dtype=object.

I can imagine 2 ways of constructing a 2d array:

  • pad each sublist to a common length
  • insert each sublist into an empty array of the appropriate size.

Basically it requires common Python iteration; it's not a common enough task to have a wiz-bang numpy function.

For example:

In [346]: n=len(A)
In [348]: m=max([len(x) for x in A])
In [349]: AA=np.zeros((n,m),int)
In [350]: for i,x in enumerate(A):
   .....:     AA[i,:len(x)] = x
In [351]: AA
Out[351]: 
array([[0, 0, 0, 0],
       [0, 0, 0, 0],
       [0, 0, 0, 0],
       [1, 2, 1, 0],
       [0, 0, 0, 0],
       [2, 2, 0, 0],
       [0, 0, 0, 0]])

To get a sparse matrix:

In [352]: from scipy import sparse
In [353]: MA=sparse.coo_matrix(AA)
In [354]: MA
Out[354]: 
<7x4 sparse matrix of type '<class 'numpy.int32'>'
    with 5 stored elements in COOrdinate format>

Nothing magical, just straight forward sparse matrix construction. I suppose you could bypass the dense matrix

There is a list-of-lists sparse format that looks a bit like your data.

In [356]: Ml=MA.tolil()

In [357]: Ml.rows
Out[357]: array([[], [], [], [0, 1, 2], [], [0, 1], []], dtype=object)

In [358]: Ml.data
Out[358]: array([[], [], [], [1, 2, 1], [], [2, 2], []], dtype=object)

Conceivably you could construct an empty sparse.lil_matrix((n,m)) matrix, and set it's .data attribute directly. But you'd also have to calculate the rows attribute.

You could also look at the data, row. col attributes of the coo format matrix, and decide it would be easy to construct the equivalent from your A list of lists.

One way or other you have to decide how the non-zero rows get padded to the full length.

Sign up to request clarification or add additional context in comments.

2 Comments

Thanks! Basically this solves my problem. (stackoverflow.com/questions/16600397/…)
This is the canonical answer. But wait: that's not all. This answer also serves as a readable introduction to numpy-fication of lists-of-lists (i.e., coercion of ragged Python lists into uniform NumPy matrices) and sparsification of matrices (i.e., coercion of uniform NumPy matrices into their sparse equivalents). In short, this answer exemplifies why I regularly return to StackOverflow. Bravissimo!
0

This:

array = numpy.array ( [ x['feature1'] for x in ver ] )

Or you need to be more clear in your example...

3 Comments

You must explain your answer, also what about OP wants the 'feature1'values in a separate numpy array?
The OP wants the values as rows of an array. That's what this answer produces. But whether all feature1 values have the same length makes a big difference in the resulting array type.
@hpaulj yes the length of each row is different, and i was hoping this would give me a sparse numpy array that would be accessible to scipy.sparse
0

You can access the value of a dictionary item by using its key:

d ={'a':1}
d['a'] --> 1

To access items in a list, you can iterate over it or use its index

a = [1,  2]

for thing in a:
    # do something with thing

a[0]  --> 1

map conveniently applies a function to all the items of an iterable and returns a list of the results. operator.getitem returns a function that will retrieve an item from an object.

import operator
import numpy as np
feature1 = operator.getitem('feature1')
a = np.asarray(map(feature1, vec))

vec = [{'is_Primary': 1, 'feature1': [2, 2, 2, 0, 0.03333333333333333, 0], 'object_id': ObjectId('557beda51d41c8e4d1aeac25'), 'vectorized': 1},
       {'is_Primary': 0, 'feature1': [2, 2, 1, 0, 0.5, 0], 'object_id': ObjectId('557beda51d41c8e4d1aeac25'), 'vectorized': 1}]

>>> a = np.asanyarray(map(feature1, vec))
>>> a.shape
(2, 6)
>>> print a
[[ 2.          2.          2.          0.          0.03333333  0.        ]
 [ 2.          2.          1.          0.          0.5         0.        ]]
>>> 
>>> for thing in a[1,:]:
    print type(thing)

<type 'numpy.float64'>
<type 'numpy.float64'>
<type 'numpy.float64'>
<type 'numpy.float64'>
<type 'numpy.float64'>
<type 'numpy.float64'>
>>> 

4 Comments

Maybe it's just a matter of style, but I think [ x['feature1'] for x in ver ] is more idiomatic, but functionally the same, as your map(operator...).
@hpaulj .. List comprehensions are cool and a lot of ppl (most) seem to think they are THE idiomatic form. For some reason I like map - but do use list comprehensions. Sometimes list comprehensions are faster sometimes map is faster (if time is important).
@wwii Your suggestion provided a string.
@ShantanuGhosh - Using the example for vec that was provided, I get an ndarray of shape (2,6) and type float64. Is your data different than the example in the question?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.