0

I'm making a fit with a scikit model (that is a ExtraTreesRegressor ) with the aim of make supervised features selection.

I've made a toy example in order to be as most clear as possible. That's the toy code:

import pandas as pd
import numpy as np
from  sklearn.ensemble import ExtraTreesRegressor
from itertools import chain

# Original Dataframe
df = pd.DataFrame({"A": [[10,15,12,14],[20,30,10,43]], "R":[2,2] ,"C":[2,2] , "CLASS":[1,0]})
X = np.array([np.array(df.A).reshape(1,4) , df.C , df.R])
Y = np.array(df.CLASS)

# prints
X = np.array([np.array(df.A), df.C , df.R])
Y = np.array(df.CLASS)

print("X",X)
print("Y",Y) 
print(df)
df['A'].apply(lambda x: print("ORIGINAL SHAPE",np.array(x).shape,"field:",x))
df['A'] = df['A'].apply(lambda x: np.array(x).reshape(4,1),"field:",x)
df['A'].apply(lambda x: print("RESHAPED SHAPE",np.array(x).shape,"field:",x))
model = ExtraTreesRegressor()
model.fit(X,Y)
model.feature_importances_
X [[[10, 15, 12, 14] [20, 30, 10, 43]]
 [2 2]
 [2 2]]

Y [1 0]

                   A  C  CLASS  R
0  [10, 15, 12, 14]  2      1  2
1  [20, 30, 10, 43]  2      0  2
ORIGINAL SHAPE (4,) field: [10, 15, 12, 14]
ORIGINAL SHAPE (4,) field: [20, 30, 10, 43]
---------------------------

That's the arise exception:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-37-5a36c4c17ea0> in <module>()
      7 print(df)
      8 model = ExtraTreesRegressor()
----> 9 model.fit(X,Y)

/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/sklearn/ensemble/forest.py in fit(self, X, y, sample_weight)
    210         """
    211         # Validate or convert input data
--> 212         X = check_array(X, dtype=DTYPE, accept_sparse="csc")
    213         if issparse(X):
    214             # Pre-sort indices to avoid that each individual tree of the

/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
    371                                       force_all_finite)
    372     else:
--> 373         array = np.array(array, dtype=dtype, order=order, copy=copy)
    374 
    375         if ensure_2d:

ValueError: setting an array element with a sequence.

I've noticed that involves np.arrays. So I've tried to fit another toy dataframe, that is the most basic one, with only scalars and there are not arised errors. I've tried to keep the same code and just modify the same toy dataframe by adding another field that contains monodimensional arrays, and now the same exception was arised.

I've looked around but so far I've not found a solution even by trying to make some reshapes, conversions into lists, np.array etc. and matrixed in my real problem. Now I'm keeping trying along this direction.

I've also seen that usually this kind of problem is arised when there are arrays withdifferent lengths betweeen samples but that is not the case of the toy example.

Anyone know how to deal with this structures/exception ? Thanks in advance for any help.

16
  • 1
    "A": [[10,15,12,14],[20,30,10,43]], np.array(df.A).reshape(1,4): Reshaping 2x4 matrix to 1x4? Commented Sep 22, 2016 at 8:53
  • no originally each row contains a vector : [10,15 ,12,14] for the first row and ,[20,30,10,43] for the second one. If I leave the original syntax for scalars the same exception is arised. Commented Sep 22, 2016 at 8:58
  • Check np.array(df.A).shape, which returns (1,) for single row, (2,) for two rows. It does not return a sort of (1, 8) or (2, 4) Commented Sep 22, 2016 at 9:01
  • SHAPE (4,) field: [10, 15, 12, 14] Commented Sep 22, 2016 at 9:02
  • Please be consistent a body and comments. Commented Sep 22, 2016 at 9:05

2 Answers 2

1

Have a closer look at your X:

>>> X
array([[[10, 15, 12, 14], [20, 30, 10, 43]],
       [2, 2],
       [2, 2]], dtype=object)
>>> type(X[0,0])
<class 'list'>

Notice that it's dtype=object, and one of these objects is a list, hence "setting array element with sequence. Part of the problem is that np.array(df.A) does not correctly create a 2D array:

>>> np.array(df.A)
array([[10, 15, 12, 14], [20, 30, 10, 43]], dtype=object)
>>> _.shape
(2,)  # oops!

But using np.stack(df.A) fixes the problem.

Are you looking for:

>>> X = np.concatenate([
        np.stack(df.A),                 # condense A to (N, 4)
        np.expand_dims(df.C, axis=-1),  # expand C to (N, 1)
        np.expand_dims(df.R, axis=-1),  # expand R to (N, 1)
        axis=-1
    )
>>> X
array([[10, 15, 12, 14,  2,  2],
       [20, 30, 10, 43,  2,  2]], dtype=int64)
Sign up to request clarification or add additional context in comments.

2 Comments

I've just try on a similar toy example the fitting with the flatted pca ( from [[],..[]] to [...] ) and works. Now I will try with the other idea of translating the pandas dataframe into a proper numpy matrix. Thanks a lot for your help!!!!
I've to point out that probably the correct version of this code should be: X = np.concatenate([np.stack(df.flat_pca,axis=0), [df.C1, df.C2]], axis=0).transpose() ; otherwise C1 and C2 will be read let's say by columns instead of by rows.
1

To convert Pandas' DataFrame to NumPy's matrix,

import pandas as pd

def df2mat(df):
    a = df.as_matrix()
    n = a.shape[0]
    m = len(a[0])
    b = np.zeros((n,m))
    for i in range(n):
        for j in range(m):
            b[i,j]=a[i][j]
return b

df = pd.DataFrame({"A":[[1,2],[3,4]]})
b = df2mat(df.A)

After then, concatenate.

1 Comment

I'm going in your direction of translating the full dataframe to a numpy structure. I've not finished yet so for the moment I can't give you a feedback. I'll do ASAP! However, thanks for your help!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.