Generating numpy arrays for scikit linear regression model

Question

I have a large dataset with multiple variables: item, location, quality (scale of 1-10), and a range of dates containing "no" if the item did not sell that day and the price if it did sell that day.

I want to create a linear regression model to be able to predict the price given a location and quality. I read through the scikit-learn tutorials, but I'm really confused as to what my input should be for the fit. Can someone help me out?

How is location represented? As a 1-10 "quality of location" figure? As lat/long values? — David Sanders
– David Sanders, Commented Jan 22, 2015 at 23:40

elyase · Accepted Answer · 2015-01-25 22:47:41Z

3

You need to convert your data to a numeric representation that models can work with. The only problematic feature is the location (categorical variable), but we can represent it with one column for each location, and 0s and 1s (so called OneHotEncoding). An example to get you started:

Preprocessing

from sklearn.feature_extraction import DictVectorizer

data  = [
        {'location': 'store 1', 'quality': 8},
        {'location': 'store 1', 'quality': 9},
        {'location': 'store 2', 'quality': 2},
        {'location': 'store 2', 'quality': 3},
        ]
prices = [100.00, 99.9, 11.25, 9.99]
vec = DictVectorizer()
X = vec.fit_transform(data)
y = prices

Now X will look like this:

╔═════════════════╦═════════════════╦═════════╗
║ location=store1 ║ location=store2 ║ quality ║
╠═════════════════╬═════════════════╬═════════╣
║               1 ║               0 ║       8 ║
║               1 ║               0 ║       9 ║
║               0 ║               1 ║       2 ║
║               0 ║               1 ║       3 ║
╚═════════════════╩═════════════════╩═════════╝

Model training

This matrix can now be feed to a model:

from sklearn.linear_model import LinearRegression

model = LinearRegression()
model.fit(X, y)

Prediction

The new data will also need to be converted into numeric form using the same DictVectorizer. Note that now we use .transform instead of .fit_transform:

>>> test_data = [{'location': 'store 2', 'quality': 3}]
>>> X_test = vec.transform(test_data)
>>> model.predict(X_test)
array([ 10.28])

By the way, I would approach this problem as a classification problem(sold/not sold) and then I would use regression to determine the price only on the sold items.

edited Jan 25, 2015 at 22:47

answered Jan 23, 2015 at 0:06

elyase

41.2k12 gold badges121 silver badges123 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

ukejoe Over a year ago

What is "y" in the model training, and how can I associate "prices" with a particular location/quality?

elyase Over a year ago

y = prices and the association is by index, i.e. the first item in the data list corresponds to the first item in the prices list, data[2]->prices[2], and so on. BTW if the location is a zip code I would convert it to string, so that the Dictvectorizer handles it as a Categorical and not a Numerical variable.

Collectives™ on Stack Overflow

Generating numpy arrays for scikit linear regression model

1 Answer 1

Preprocessing

Model training

Prediction

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Preprocessing

Model training

Prediction

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related