I trying to handling missing values in one of the column with linear regression.
The name of the column is "Landsize" and I am trying to predict NaN values with linear regression using several other variables.
Here is the lin. regression code:
# Importing the dataset
dataset = pd.read_csv('real_estate.csv')
from sklearn.linear_model import LinearRegression
linreg = LinearRegression()
data = dataset[['Price','Rooms','Distance','Landsize']]
#Step-1: Split the dataset that contains the missing values and no missing values are test and train respectively.
x_train = data[data['Landsize'].notnull()].drop(columns='Landsize')
y_train = data[data['Landsize'].notnull()]['Landsize']
x_test = data[data['Landsize'].isnull()].drop(columns='Landsize')
y_test = data[data['Landsize'].isnull()]['Landsize']
#Step-2: Train the machine learning algorithm
linreg.fit(x_train, y_train)
#Step-3: Predict the missing values in the attribute of the test data.
predicted = linreg.predict(x_test)
#Step-4: Let’s obtain the complete dataset by combining with the target attribute.
dataset.Landsize[dataset.Landsize.isnull()] = predicted
dataset.info()
When I try to check the regression result I get this error:
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
Accuracy:
accuracy = linreg.score(x_test, y_test)
print(accuracy*100,'%')