Python dask_ml linear regression Multiple constant columns detected error

Question

I am using python with dask to create a logistic regression model, In order to speed up things when training.

I have x that is the feature array (numpy array) and y that is a label vector.

edit: The numpy arrays are: x_train (n*m size) array of floats and the y_train is (n*1) vector of integers that are labels for the training. both suits well into sklearn LogisticRegression.fit and working fine there.

I tried to use this code to create a pandas df then converting it to dask ddf and training on it like shown here

from dask_ml.linear_model import LogisticRegression
from dask import dataframe as dd
df["label"] = y_train
sd = dd.from_pandas(df, npartitions=3)
lr = LogisticRegression(fit_intercept=False)
lr.fit(sd, sd["label"])

But getting an error

Could not find signature for add_intercept:

I found this issue on Gitgub

Explaining to use this code instead

from dask_ml.linear_model import LogisticRegression
from dask import dataframe as dd
df["label"] = y_train
sd = dd.from_pandas(df, npartitions=3)
lr = LogisticRegression(fit_intercept=False)
lr.fit(sd.values, sd["label"])

But I get this error

ValueError: Multiple constant columns detected!

How can I use dask to train a logistic regression over data originated from a numpy array?

Thanks.

How you have created the y_train? How is the corresponding x values created? Please mention that as it is confusing. — Amazing Things Around You
– Amazing Things Around You, Commented Jun 6, 2019 at 8:55
x_train is a numpy array of numbers, and y_train is a numpy vector of labels (integers) — thebeancounter
– thebeancounter, Commented Jun 6, 2019 at 9:11
But you have not mentioned in the question what are the contents of the x_train and y_train? — Amazing Things Around You
– Amazing Things Around You, Commented Jun 6, 2019 at 9:15
@AmazingThingsAroundYou I edited the question, But the origin of the numpy array is not relevant to this question, this is an API issue — thebeancounter
– thebeancounter, Commented Jun 6, 2019 at 9:28

FeatCrush · Accepted Answer · 2019-07-31 16:10:03Z

1

You can bypass std verification by using

lr = LogisticRegression(solver_kwargs={"normalize":False})

Or you can use @Emptyless code to get faulty column_indices and then remove those columns from your array.

answered Jul 31, 2019 at 16:10

FeatCrush

236 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Emptyless · Accepted Answer · 2019-06-27 10:26:55Z

0

This does not seem like an issue with dask_ml. Looking at the source, the std is calculated using:

mean, std = da.compute(X.mean(axis=0), X.std(axis=0))

This means that for every column in your provided array, dask_ml calculates the standard deviation. If the standard deviation of one of those columns is equal to zero (np.where(std == 0))) that means that that column has zero variation.

Including a column with zero variation does not allow any training, ergo it needs to be removed prior to training the model (in a data preparation / cleansing step).

You can quickly check which columns have no variation by checking the following:

import numpy as np

std = sd.std(axis=0)
column_indices = np.where(std == 0)
print(column_indices)

answered Jun 27, 2019 at 10:26

Emptyless

3,0714 gold badges22 silver badges32 bronze badges

3 Comments

thebeancounter Over a year ago

I do thing it's a dask issue, you should be able to train a model on data that has a column with zero variance, a lot of models training is based on deep learning feature extraction, where a column of the data can easily be all 0's

Emptyless Over a year ago

You can submit an issue or pull request to dask_ml and propose a new **kwarg that would allow this behavior and strip these columns inside the LinearRegression. I believe however that this is more in the data preparation / cleansing phase. However, at least this provides some insight in why it would fail and how to continue in the mean time.

John R Over a year ago

This is definitely an API issue. Zero variance columns should be ignored during the training process and assigned a coefficient of 0.

mYucesan · Accepted Answer · 2021-01-01 14:31:44Z

A little late to the party but here I go anyway. Hope future readers appreciate it. This answer is for the Multiple Columns error.

A Dask DataFrame is split up into many Pandas DataFrames. These are called partitions. If you set your npartitions to 1 it should have exactly the same effect as sci-kit learn. If you set it to more partitions it splits it into multiple DataFrames but I found it changes the shape of the DataFrames which in the end resulted in the Multiple Columns error. It also might cause a overflow warning. Unfortunately it is not in my interest to investigate the direct cause of this error. It might simply be because the DataFrame is too large or too small.

A source for partitioning

Below the errors for search engine indexing:

ValueError: Multiple constant columns detected!
RuntimeWarning: overflow encountered in exp return np.exp(A)

Collectives™ on Stack Overflow

Python dask_ml linear regression Multiple constant columns detected error

3 Answers 3

Comments

3 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related