17

I tried this but couldn't get it to work for my data: Use Scikit Learn to do linear regression on a time series pandas data frame

My data consists of 2 DataFrames. DataFrame_1.shape = (40,5000) and DataFrame_2.shape = (40,74). I'm trying to do some type of linear regression, but DataFrame_2 contains NaN missing data values. When I DataFrame_2.dropna(how="any") the shape drops to (2,74).

Is there any linear regression algorithm in sklearn that can handle NaN values?

I'm modeling it after the load_boston from sklearn.datasets where X,y = boston.data, boston.target = (506,13),(506,)

Here's my simplified code:

X = DataFrame_1
for col in DataFrame_2.columns:
    y = DataFrame_2[col]
    model = LinearRegression()
    model.fit(X,y)

#ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

I did the above format to get the shapes to match up of the matrices

If posting the DataFrame_2 would help, please comment below and I'll add it.

2 Answers 2

8

You can fill in the null values in y with imputation. In scikit-learn this is done with the following code snippet:

from sklearn.preprocessing import Imputer
imputer = Imputer()
y_imputed = imputer.fit_transform(y)

Otherwise, you might want to build your model using a subset of the 74 columns as predictors, perhaps some of your columns contain less null values?

Sign up to request clarification or add additional context in comments.

8 Comments

I tried this with my column and got TypeError: unbound method fit_transform() must be called with Imputer instance as first argument (got Series instance instead) then tried it with the whole DataFrame and got the same thing (w/ DataFrame instead of Series)
with scikit you need to call things on the underlying numpy arrays not the dataframes themself; you should have already set X=DataFrame_1.values and y=Dataframe_2.values
oops also i gave you the wrong syntax for the imputer, i've fixed the code
What I don't understand is why y_imputed returns a numpy array that differs in length from the original len of the column in the original data frame? The non-imputed numbers are included, so its not a matter of showing "imputed only". What gives?
Imputer is deprecated in sklearn 0.23.2, use sklearn.impute.SimpleImputer
|
4

If your variable is a DataFrame, you could use fillna. Here I replaced the missing data with the mean of that column.

df.fillna(df.mean(), inplace=True)

1 Comment

Yes! This is the default function of the sklearn imputer

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.