Linear regression using sklearn array issue

Question

Just trying to set up a simple linear regression test based on the following example.

Here is my code:

# Normalize customer data
x_array = np.array(CustomerRFM['recency'])
normalized_X = preprocessing.normalize([x_array])
y_array = np.array(CustomerRFM['monetary_value'])
normalized_Y = preprocessing.normalize([y_array])

print('normalized_X: ' + str(np.count_nonzero(normalized_X)))
print('normalized_Y: ' + str(np.count_nonzero(normalized_Y)))

X_train, X_test = train_test_split(normalized_X, test_size=0.2)
Y_train, Y_test = train_test_split(normalized_Y, test_size=0.2)

print('X_train: ' + str(np.count_nonzero(X_train)))
print('Y_train: ' + str(np.count_nonzero(Y_train)))

regr = LinearRegression()
regr.fit(X_train, Y_train)

I have added the four print() lines as I am getting a strange issue. The console print of these four lines is:

normalized_X: 4304
normalized_Y: 4338
X_train: 0
Y_train: 0

For some reason when I am splitting the data between training and testing data I get no values?

I get the following error on the regr.fit() line:

ValueError: Found array with 0 sample(s) (shape=(0, 4339)) while a minimum of 1 is required.

This tells me there is something wrong with the X values but I don't know what

UPDATE: Change to print(array.shape)

If I change my code to use

print('normalized_X: ' + str(normalized_X.shape))
print('normalized_Y: ' + str(normalized_Y.shape))

and this:

print('X_train: ' + str(X_train.shape))
print('Y_train: ' + str(Y_train.shape))

I get:

normalized_X: (1, 4339)
normalized_Y: (1, 4339)

and this:

X_train: (0, 4339)
Y_train: (0, 4339)

Before counting for non zero values, did you just print (X_train) and print (Y_train) to see what's inside — Sheldore
– Sheldore, Commented Jan 1, 2019 at 18:45
more helpful than print(np.count_nonzero(array)) would be print(array.shape). count_nonzero will flatten dimensions and ignore zero values - two features that are counterproductive here. shape is where a lot of tricky exciting things happen — waterproof
– waterproof, Commented Jan 1, 2019 at 18:54
but I don't understand why as both normalized X and Y have data — Silentbob
– Silentbob, Commented Jan 1, 2019 at 18:54
X_train, X_test = train_test_split(np.transpose(normalized_X), test_size=0.2) Y_train, Y_test = train_test_split(np.transpose(normalized_Y,) test_size=0.2) — Ankur Goel
– Ankur Goel, Commented Jan 1, 2019 at 19:01

waterproof · Accepted Answer · 2019-01-01 19:04:48Z

1

It looks like you're using preprocessing.normalize incorrectly. By wrapping [x_array] in square brackets, you're creating an array of shape (1, 4339).

According to the docs, preprocessing.normalize expects an array of shape [n_samples, n_features]. In your example, n_samples is 1 and n_features is 4339 which I don't think is what you want! You're then asking train_test_split to split a data set of one sample, so it understandably returns an empty array.

answered Jan 1, 2019 at 19:04

waterproof

5,1935 gold badges34 silver badges31 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Linear regression using sklearn array issue

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related