1

I trying to handling missing values in one of the column with linear regression.

The name of the column is "Landsize" and I am trying to predict NaN values ​​with linear regression using several other variables.

Here is the lin. regression code:

# Importing the dataset
dataset = pd.read_csv('real_estate.csv')

from sklearn.linear_model import LinearRegression
linreg = LinearRegression()
data = dataset[['Price','Rooms','Distance','Landsize']]
#Step-1: Split the dataset that contains the missing values and no missing values are test and train respectively.
x_train = data[data['Landsize'].notnull()].drop(columns='Landsize')
y_train = data[data['Landsize'].notnull()]['Landsize']
x_test = data[data['Landsize'].isnull()].drop(columns='Landsize')
y_test = data[data['Landsize'].isnull()]['Landsize']
#Step-2: Train the machine learning algorithm
linreg.fit(x_train, y_train)
#Step-3: Predict the missing values in the attribute of the test data.
predicted = linreg.predict(x_test)
#Step-4: Let’s obtain the complete dataset by combining with the target attribute.
dataset.Landsize[dataset.Landsize.isnull()] = predicted
dataset.info()

When I try to check the regression result I get this error:

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

Accuracy:

accuracy = linreg.score(x_test, y_test)
print(accuracy*100,'%')
1
  • Are you converted "Nan" to "numeric nan value?" Commented Aug 25, 2019 at 13:34

1 Answer 1

2

I think what you are doing wrong here is you are passing NaN values to the algorithm, dealing with NaN values is one of the primary steps for preprocessing data. So perhaps you need to convert your NaN values to 0 and predict when you have Landsize = 0 (which is the same as having NaN value logically because a landsize can't be 0 ).

Another thing I think you're doing wrong is:

x_train = data[data['Landsize'].notnull()].drop(columns='Landsize') 
y_train = data[data['Landsize'].notnull()]['Landsize']
x_test = data[data['Landsize'].isnull()].drop(columns='Landsize')
y_test = data[data['Landsize'].isnull()]['Landsize']

You're assigning the same data for your training and test set. You should maybe do:

X = data[data['Landsize'].notnull()].drop(columns='Landsize')    
y = data[data['Landsize'].notnull()]['Landsize']  
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
Sign up to request clarification or add additional context in comments.

1 Comment

You don't have to change the algorithm, your problem is a regression problem so a regression algorithm can solve it you just have to fit your data to the problem ;) 80% of machine learning is data science and fitting your data into a format that is suitable.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.