2

I am new to machine learning, I am trying to apply logistic regression on my sample data set I have a single feature that contains a list of numbers and want to predict class.

the following is my code

from sklearn.linear_model import LogisticRegression
a = [[1,2,3], [1,2,3,4,5,6], [4,5,6,7], [0,0,0,7,1,2,3]]
b = [0,1,0, 0]
p = [[9,0,2,4]]

clfModel1 = LogisticRegression(class_weight='balanced')
clfModel1.fit(a,b)
clfModel1.predict(p)

I am getting the following error

Traceback (most recent call last):
  File "F:\python_3.4\NLP\t.py", line 7, in <module>
    clfModel1.fit(a,b)
  File "C:\Python34\lib\site-packages\sklearn\linear_model\logistic.py", line 1173, in fit
    order="C")
  File "C:\Python34\lib\site-packages\sklearn\utils\validation.py", line 521, in check_X_y
    ensure_min_features, warn_on_dtype, estimator)
  File "C:\Python34\lib\site-packages\sklearn\utils\validation.py", line 382, in check_array
    array = np.array(array, dtype=dtype, order=order, copy=copy)
ValueError: setting an array element with a sequence.
>>>

Is there some way to change the data such that I can the apply the classifier and predict the results

4
  • Your a is not a valid input - it is a staggered "matrix". In logistic regression, each feature needs to be a number, not a list. How is this suppose to work with logistic regression? Commented Aug 3, 2017 at 19:05
  • I thought the same thing, Is there a way around, please help Commented Aug 3, 2017 at 19:09
  • This sounds more like a question for CrossValidated. Commented Aug 3, 2017 at 19:11
  • 1
    Fundamentally, the fix you want to make the LogisticRegression.fit method work is to make all the sublists in a the same size, however, wihtout understanding your data there is no way to say how you could do that in a valid and useful way. Commented Aug 3, 2017 at 19:13

1 Answer 1

5

Logistic regression is an estimator for functions of form:

R^d -> [0,1]

But your data clearly is not a subset of R^d, as each sample in a has different length (number of dimensions), thus it cannot be applied.

Another problem is that p should be a list of samples too, not a single sample (and it has to have d dimensions too, of course).

There is no "way around this" it is simply a wrong idea. What is a typical solution to working with "odd" data:

  • you predefine your own, custom mapping (feature extraction step) that given your varied-length point outputs a fixed length representation (so outputs d numbers). There is no general way of doing that - everything depends on data.
  • there are models that can deal with varied length inputs, such as LSTMs, but it is a huge jump from logistic regression to recurrent neural nets.
  • use methods which are similarity-based (like kNN) and simply define your own measure of what it means that two "lists of numbers" are similar.

There is no other way - either rethink representation of your data, or change approach.

Sign up to request clarification or add additional context in comments.

2 Comments

I get your point, I have changed p as you suggested but still there is no way to input a staggered data as input to classifier. Thanks
Indeed, as described in the first part of the answer, LR cannot be applied to such data. Issue with p was a minor thing.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.