Handling missing values with linear regression

Question

I trying to handling missing values in one of the column with linear regression.

The name of the column is "Landsize" and I am trying to predict NaN values with linear regression using several other variables.

Here is the lin. regression code:

# Importing the dataset
dataset = pd.read_csv('real_estate.csv')

from sklearn.linear_model import LinearRegression
linreg = LinearRegression()
data = dataset[['Price','Rooms','Distance','Landsize']]
#Step-1: Split the dataset that contains the missing values and no missing values are test and train respectively.
x_train = data[data['Landsize'].notnull()].drop(columns='Landsize')
y_train = data[data['Landsize'].notnull()]['Landsize']
x_test = data[data['Landsize'].isnull()].drop(columns='Landsize')
y_test = data[data['Landsize'].isnull()]['Landsize']
#Step-2: Train the machine learning algorithm
linreg.fit(x_train, y_train)
#Step-3: Predict the missing values in the attribute of the test data.
predicted = linreg.predict(x_test)
#Step-4: Let’s obtain the complete dataset by combining with the target attribute.
dataset.Landsize[dataset.Landsize.isnull()] = predicted
dataset.info()

When I try to check the regression result I get this error:

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

Accuracy:

accuracy = linreg.score(x_test, y_test)
print(accuracy*100,'%')

Are you converted "Nan" to "numeric nan value?"

BarzanHayati
– BarzanHayati

2019-08-25 13:34:57 +00:00
Commented Aug 25, 2019 at 13:34 — BarzanHayati
– BarzanHayati, Commented Aug 25, 2019 at 13:34

Souha Gaaloul · Accepted Answer · 2019-08-25 16:13:01Z

2

I think what you are doing wrong here is you are passing NaN values to the algorithm, dealing with NaN values is one of the primary steps for preprocessing data. So perhaps you need to convert your NaN values to 0 and predict when you have Landsize = 0 (which is the same as having NaN value logically because a landsize can't be 0 ).

Another thing I think you're doing wrong is:

x_train = data[data['Landsize'].notnull()].drop(columns='Landsize') 
y_train = data[data['Landsize'].notnull()]['Landsize']
x_test = data[data['Landsize'].isnull()].drop(columns='Landsize')
y_test = data[data['Landsize'].isnull()]['Landsize']

You're assigning the same data for your training and test set. You should maybe do:

X = data[data['Landsize'].notnull()].drop(columns='Landsize')    
y = data[data['Landsize'].notnull()]['Landsize']  
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

answered Aug 25, 2019 at 16:13

Souha Gaaloul

3284 silver badges10 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Souha Gaaloul Over a year ago

You don't have to change the algorithm, your problem is a regression problem so a regression algorithm can solve it you just have to fit your data to the problem ;) 80% of machine learning is data science and fitting your data into a format that is suitable.

Collectives™ on Stack Overflow

Handling missing values with linear regression

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related