4

As part of a project, I am trying to use the random forest classifier from Python's SKLearn library. I have been using this tutorial as a guide: https://chrisalbon.com/machine_learning/trees_and_forests/random_forest_classifier_example/.

My code follows this tutorial line by line, but the only major difference is the structure of the data. In the tutorial, there are 4 features (4 columns in the data table), and each entry in a column is a number. In my code, I have 1 feature (1 column in the data table), and each entry in a column is a numpy array. When I call the fit() function, I get the following error: ValueError: setting an array element with a sequence.

Here is my code:

import pandas as pd
import numpy as np
import random
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix

trainingData = [[[0, 0, 3], 0.77], [[24, 0, 5], 30], [[0, 0, 4], 0.77], [[0, 0, 0], 0.77]]
vectors_train = []
for i in range (0, len(trainingData)):
    vectors_train.append(trainingData[i][0])

testingData = [[[1, 0, 0], 0.77], [[30, 0, 5], 30], [[0, 0, 0], 0.77], [[0, 0, 0], 0.77]]
vectors_test = []
for i in range (0, len(testingData)):
    vectors_test.append(testingData[i][0])

dataframe_training = pd.DataFrame(trainingData)
dataframe_training['is_train'] = True
dataframe_testing = pd.DataFrame(testingData)
dataframe_testing['is_train'] = False
frames = [dataframe_training, dataframe_testing]
dataframe = pd.concat(frames)
dataframe.rename(index = str, columns = {0: 'Vector', 1: 'Label', 2: 'is_train'})

train, test = dataframe[dataframe['is_train']==True], dataframe[dataframe['is_train']==False]
features = dataframe.columns[:1]
labels_train, uniques = pd.factorize(train[1], sort = True)
clf = RandomForestClassifier()

clf.fit(train[features], labels)              # Value error occurs here

I am confused by what the error actually means. What array element is being set to a sequence, and where is this sequence? I'm also aware thattrain[features] is a DataFrame object, and that the fit() function takes in two parameters, both of which must be array-like. labels is an array, and the error specifically points to the first parameter being the problem, so is there a data type conversion I have to do?

When I replace the line clf.fit(train[features], labels) with clf.fit(vectors_train, labels), the error goes away. However, I want to know why it is not working when I use the same strategy as the tutorial and how to get it to work in a similar fashion.

Any help would be much appreciated. Thanks!

2 Answers 2

2

Remove the features variable and make the last line:

clf.fit(train[0].tolist(), labels)

No error raised with the code above.

Your code isn't working because columns as you do column[:1] returns a sequence with one column, however column[0] won't, and if you feed that int to cls.fit doing train[features] with the columns[0] as features, it still won't work since it requires a list or array, so train[features].tolist() will also work.

Sign up to request clarification or add additional context in comments.

2 Comments

This really helped me, thanks. Even if i had a numpy array but still it need to be treated like a list. I was actually trying to feed mean of word2vec embeds in a classifier which was throwing this sequence error.
This really helped me, thanks. Even if i had a numpy array but still it need to be treated like a list. I was actually trying to feed mean of word2vec embeds in a classifier which was throwing this sequence error.
0

You have this error because your data is not formatted correctly when you call the fit method. Your input is a DataFrame (with one column) of list, but the fit method is expecting a numpy array.

It should work if you do instead:

X = np.array(train[features][0].tolist())
clf.fit(X, labels_train)

So X is an array with 4 examples, each with 3 features.

4 Comments

Thank you, this works! As a quick follow-up question: suppose I wanted to add another feature - a magnitude. So the first column of the table would be the vector column in my original question, and the second column would be for the magnitude (where each entry is just a number). Would I have to add another line like Y = np.array(train[features][1].tolist()), and what would be the first parameter I pass into the fit() function? I tried adding this line and passing in [X, Y] to fit(), but I got the error: ValueError: could not broadcast input array from shape (80,2) into shape (80).
If X and Y are both feature matrices, then you would need to stack them. So if X is of the shape (n_examples, n_features_X) and Y is of the shape (n_examples, n_features_Y), you need to create a new feature matrix of the shape (n_examples, n_features_X+n_features_Y). For example, you could just do Z = np.hstack([X,Y])
In my current code, X is of the shape (80, ) and Y is of the shape (80, 2). When I do Z = np.hstack([X,Y]) I get an error: ValueError: all the input arrays must have same number of dimensions. Does this mean that n_features_X must be the same value as n_features_Y for the hstack() function?
It's because X is a 1 dimensional vector. Z = np.hstack([X.reshape(-1,1),Y]) should work

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.