SKLearn ValueError: setting an array element with a sequence

Question

As part of a project, I am trying to use the random forest classifier from Python's SKLearn library. I have been using this tutorial as a guide: https://chrisalbon.com/machine_learning/trees_and_forests/random_forest_classifier_example/.

My code follows this tutorial line by line, but the only major difference is the structure of the data. In the tutorial, there are 4 features (4 columns in the data table), and each entry in a column is a number. In my code, I have 1 feature (1 column in the data table), and each entry in a column is a numpy array. When I call the fit() function, I get the following error: ValueError: setting an array element with a sequence.

Here is my code:

import pandas as pd
import numpy as np
import random
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix

trainingData = [[[0, 0, 3], 0.77], [[24, 0, 5], 30], [[0, 0, 4], 0.77], [[0, 0, 0], 0.77]]
vectors_train = []
for i in range (0, len(trainingData)):
    vectors_train.append(trainingData[i][0])

testingData = [[[1, 0, 0], 0.77], [[30, 0, 5], 30], [[0, 0, 0], 0.77], [[0, 0, 0], 0.77]]
vectors_test = []
for i in range (0, len(testingData)):
    vectors_test.append(testingData[i][0])

dataframe_training = pd.DataFrame(trainingData)
dataframe_training['is_train'] = True
dataframe_testing = pd.DataFrame(testingData)
dataframe_testing['is_train'] = False
frames = [dataframe_training, dataframe_testing]
dataframe = pd.concat(frames)
dataframe.rename(index = str, columns = {0: 'Vector', 1: 'Label', 2: 'is_train'})

train, test = dataframe[dataframe['is_train']==True], dataframe[dataframe['is_train']==False]
features = dataframe.columns[:1]
labels_train, uniques = pd.factorize(train[1], sort = True)
clf = RandomForestClassifier()

clf.fit(train[features], labels)              # Value error occurs here

I am confused by what the error actually means. What array element is being set to a sequence, and where is this sequence? I'm also aware thattrain[features] is a DataFrame object, and that the fit() function takes in two parameters, both of which must be array-like. labels is an array, and the error specifically points to the first parameter being the problem, so is there a data type conversion I have to do?

When I replace the line clf.fit(train[features], labels) with clf.fit(vectors_train, labels), the error goes away. However, I want to know why it is not working when I use the same strategy as the tutorial and how to get it to work in a similar fashion.

Any help would be much appreciated. Thanks!

U13-Forward · Accepted Answer · 2019-07-11 00:09:02Z

2

Remove the features variable and make the last line:

clf.fit(train[0].tolist(), labels)

No error raised with the code above.

Your code isn't working because columns as you do column[:1] returns a sequence with one column, however column[0] won't, and if you feed that int to cls.fit doing train[features] with the columns[0] as features, it still won't work since it requires a list or array, so train[features].tolist() will also work.

answered Jul 11, 2019 at 0:09

U13-Forward

71.8k15 gold badges100 silver badges125 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Bilal Chandio Over a year ago

This really helped me, thanks. Even if i had a numpy array but still it need to be treated like a list. I was actually trying to feed mean of word2vec embeds in a classifier which was throwing this sequence error.

Bilal Chandio Over a year ago

This really helped me, thanks. Even if i had a numpy array but still it need to be treated like a list. I was actually trying to feed mean of word2vec embeds in a classifier which was throwing this sequence error.

Nakor · Accepted Answer · 2019-07-11 00:08:48Z

0

You have this error because your data is not formatted correctly when you call the fit method. Your input is a DataFrame (with one column) of list, but the fit method is expecting a numpy array.

It should work if you do instead:

X = np.array(train[features][0].tolist())
clf.fit(X, labels_train)

So X is an array with 4 examples, each with 3 features.

answered Jul 11, 2019 at 0:08

Nakor

1,5142 gold badges14 silver badges25 bronze badges

4 Comments

pumpkin39 Over a year ago

Thank you, this works! As a quick follow-up question: suppose I wanted to add another feature - a magnitude. So the first column of the table would be the vector column in my original question, and the second column would be for the magnitude (where each entry is just a number). Would I have to add another line like Y = np.array(train[features][1].tolist()), and what would be the first parameter I pass into the fit() function? I tried adding this line and passing in [X, Y] to fit(), but I got the error: ValueError: could not broadcast input array from shape (80,2) into shape (80).

Nakor Over a year ago

If X and Y are both feature matrices, then you would need to stack them. So if X is of the shape (n_examples, n_features_X) and Y is of the shape (n_examples, n_features_Y), you need to create a new feature matrix of the shape (n_examples, n_features_X+n_features_Y). For example, you could just do Z = np.hstack([X,Y])

pumpkin39 Over a year ago

In my current code, X is of the shape (80, ) and Y is of the shape (80, 2). When I do Z = np.hstack([X,Y]) I get an error: ValueError: all the input arrays must have same number of dimensions. Does this mean that n_features_X must be the same value as n_features_Y for the hstack() function?

Nakor Over a year ago

It's because X is a 1 dimensional vector. Z = np.hstack([X.reshape(-1,1),Y]) should work

Collectives™ on Stack Overflow

SKLearn ValueError: setting an array element with a sequence

2 Answers 2

2 Comments

4 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

4 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related