1

I have a large dataset with multiple variables: item, location, quality (scale of 1-10), and a range of dates containing "no" if the item did not sell that day and the price if it did sell that day.

I want to create a linear regression model to be able to predict the price given a location and quality. I read through the scikit-learn tutorials, but I'm really confused as to what my input should be for the fit. Can someone help me out?

2
  • How is location represented? As a 1-10 "quality of location" figure? As lat/long values? Commented Jan 22, 2015 at 23:40
  • Location is represented by a zip code Commented Jan 25, 2015 at 21:17

1 Answer 1

3

You need to convert your data to a numeric representation that models can work with. The only problematic feature is the location (categorical variable), but we can represent it with one column for each location, and 0s and 1s (so called OneHotEncoding). An example to get you started:

Preprocessing

from sklearn.feature_extraction import DictVectorizer

data  = [
        {'location': 'store 1', 'quality': 8},
        {'location': 'store 1', 'quality': 9},
        {'location': 'store 2', 'quality': 2},
        {'location': 'store 2', 'quality': 3},
        ]
prices = [100.00, 99.9, 11.25, 9.99]
vec = DictVectorizer()
X = vec.fit_transform(data)
y = prices

Now X will look like this:

╔═════════════════╦═════════════════╦═════════╗
║ location=store1 ║ location=store2 ║ quality ║
╠═════════════════╬═════════════════╬═════════╣
║               1 ║               0 ║       8 ║
║               1 ║               0 ║       9 ║
║               0 ║               1 ║       2 ║
║               0 ║               1 ║       3 ║
╚═════════════════╩═════════════════╩═════════╝

Model training

This matrix can now be feed to a model:

from sklearn.linear_model import LinearRegression

model = LinearRegression()
model.fit(X, y)

Prediction

The new data will also need to be converted into numeric form using the same DictVectorizer. Note that now we use .transform instead of .fit_transform:

>>> test_data = [{'location': 'store 2', 'quality': 3}]
>>> X_test = vec.transform(test_data)
>>> model.predict(X_test)
array([ 10.28])

By the way, I would approach this problem as a classification problem(sold/not sold) and then I would use regression to determine the price only on the sold items.

Sign up to request clarification or add additional context in comments.

2 Comments

What is "y" in the model training, and how can I associate "prices" with a particular location/quality?
y = prices and the association is by index, i.e. the first item in the data list corresponds to the first item in the prices list, data[2]->prices[2], and so on. BTW if the location is a zip code I would convert it to string, so that the Dictvectorizer handles it as a Categorical and not a Numerical variable.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.