2

I am using python with dask to create a logistic regression model, In order to speed up things when training.

I have x that is the feature array (numpy array) and y that is a label vector.

edit: The numpy arrays are: x_train (n*m size) array of floats and the y_train is (n*1) vector of integers that are labels for the training. both suits well into sklearn LogisticRegression.fit and working fine there.

I tried to use this code to create a pandas df then converting it to dask ddf and training on it like shown here

from dask_ml.linear_model import LogisticRegression
from dask import dataframe as dd
df["label"] = y_train
sd = dd.from_pandas(df, npartitions=3)
lr = LogisticRegression(fit_intercept=False)
lr.fit(sd, sd["label"])

But getting an error

Could not find signature for add_intercept:

I found this issue on Gitgub

Explaining to use this code instead

from dask_ml.linear_model import LogisticRegression
from dask import dataframe as dd
df["label"] = y_train
sd = dd.from_pandas(df, npartitions=3)
lr = LogisticRegression(fit_intercept=False)
lr.fit(sd.values, sd["label"])

But I get this error

ValueError: Multiple constant columns detected!

How can I use dask to train a logistic regression over data originated from a numpy array?

Thanks.

4
  • How you have created the y_train? How is the corresponding x values created? Please mention that as it is confusing. Commented Jun 6, 2019 at 8:55
  • x_train is a numpy array of numbers, and y_train is a numpy vector of labels (integers) Commented Jun 6, 2019 at 9:11
  • But you have not mentioned in the question what are the contents of the x_train and y_train? Commented Jun 6, 2019 at 9:15
  • @AmazingThingsAroundYou I edited the question, But the origin of the numpy array is not relevant to this question, this is an API issue Commented Jun 6, 2019 at 9:28

3 Answers 3

1

You can bypass std verification by using

lr = LogisticRegression(solver_kwargs={"normalize":False})

Or you can use @Emptyless code to get faulty column_indices and then remove those columns from your array.

Sign up to request clarification or add additional context in comments.

Comments

0

This does not seem like an issue with dask_ml. Looking at the source, the std is calculated using:

mean, std = da.compute(X.mean(axis=0), X.std(axis=0))

This means that for every column in your provided array, dask_ml calculates the standard deviation. If the standard deviation of one of those columns is equal to zero (np.where(std == 0))) that means that that column has zero variation.

Including a column with zero variation does not allow any training, ergo it needs to be removed prior to training the model (in a data preparation / cleansing step).

You can quickly check which columns have no variation by checking the following:

import numpy as np

std = sd.std(axis=0)
column_indices = np.where(std == 0)
print(column_indices)

3 Comments

I do thing it's a dask issue, you should be able to train a model on data that has a column with zero variance, a lot of models training is based on deep learning feature extraction, where a column of the data can easily be all 0's
You can submit an issue or pull request to dask_ml and propose a new **kwarg that would allow this behavior and strip these columns inside the LinearRegression. I believe however that this is more in the data preparation / cleansing phase. However, at least this provides some insight in why it would fail and how to continue in the mean time.
This is definitely an API issue. Zero variance columns should be ignored during the training process and assigned a coefficient of 0.
0

A little late to the party but here I go anyway. Hope future readers appreciate it. This answer is for the Multiple Columns error.

A Dask DataFrame is split up into many Pandas DataFrames. These are called partitions. If you set your npartitions to 1 it should have exactly the same effect as sci-kit learn. If you set it to more partitions it splits it into multiple DataFrames but I found it changes the shape of the DataFrames which in the end resulted in the Multiple Columns error. It also might cause a overflow warning. Unfortunately it is not in my interest to investigate the direct cause of this error. It might simply be because the DataFrame is too large or too small.

A source for partitioning

Below the errors for search engine indexing:

  • ValueError: Multiple constant columns detected!
  • RuntimeWarning: overflow encountered in exp return np.exp(A)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.