0

I am building 1500 different models to predict 1500 different y values using the same 1500 predictors, Xs, in a linear model. I have 15 data points for each. I have these the Ys in one array, the Xs in another.

Ys = np.random.rand(15,1500)
Xs = np.random.rand(15,1500)

I can loop through the columns of Ys and fit my model and get the coefficients for all the Xs.

>>> from sklearn import linear_model
>>> clf = linear_model.LinearRegression()

>>> def f(Ys,Xs):
...     for i in range(Ys.shape[1]):
...         clf.fit(Xs,Ys[:,i])
...         print clf.coef_

>>> f(Ys,Xs)
[ 0.00415945  0.00518805  0.00200809 ..., -0.00293134  0.00405276
 -0.00082493]
[-0.00278009 -0.00926449  0.00849694 ..., -0.00183793  0.00493365
 -0.00053502]
[-0.004892   -0.00067937  0.00490643 ...,  0.00074988  0.00166438
  0.00197527]...

This works well enough, but looping through the columns of Ys seems like an inefficient way to deal with these arrays, especially once I introduce cross-validation into the picture.

Is there some sort of apply equivalent (like in pandas) that would make this more efficient?

2 Answers 2

3

A couple of thoughts:

(1) Given that each linear model has more predictors (1500) than data points (15), your models will be overfit to the training data (they will have no predictive power on new data). Consider using ridge regression instead (http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html)

(2) If you are using the same set of predictors repeatedly in a series of linear models, you can take into account that the solution to a linear regression is coef = inv(Xs'*Xs)Xsy . Notice that inv(Xs'*Xs)*Xs is the same for each of your linear models. Therefore, you can compute all of your linear models simultaneously as inv(Xs'*Xs)XsYs. If you wind up using Ridge regression, you will need to modify this formula slightly to be inv(Xs'Xs + alphaI)XsYs (where I is a 15 by 15 identity matrix).

Sign up to request clarification or add additional context in comments.

1 Comment

Like magic! Regarding (1), yup, I'm still playing around with the parameter search and cross validation, so I figured I'd start with the simplest-but-incorrect approach. With the Ridge modification to the formula, is the identity matrix supposed to be 15x15 or 1500 x1500?
1

The linear regression estimator supports multi-target regression out of the box, you can simply do:

>>> import numpy as np
>>> Ys = np.random.rand(15,1500)
>>> Xs = np.random.rand(15,1500)
>>> from sklearn.linear_model import LinearRegression
>>> clf = LinearRegression().fit(Xs, Ys)

The coefficients are stored in the coef_ attribute of shape (n_targets, n_features):

>>> clf.coef_
array([[  5.55249034e-03,   4.80064644e-03,  -9.84935468e-03, ...,
     -4.56988996e-03,   1.13633031e-03,   1.76111517e-03],
   [ -3.92718305e-03,  -3.97534623e-03,   6.19243263e-03, ...,
     -1.87971624e-03,  -1.45732814e-03,   1.51018259e-03],
   [ -6.79887329e-04,  -4.80656996e-04,   1.74724622e-03, ...,
     -3.42881741e-04,  -3.48451425e-03,  -3.85790348e-04],
   ..., 
   [ -1.73318217e-03,  -8.70409477e-03,  -9.64475499e-05, ...,
     -4.52182601e-03,   3.49238171e-03,  -1.50492517e-03],
   [  2.77132135e-05,  -7.12606751e-04,   4.32136642e-03, ...,
      3.34105396e-03,   1.98439783e-03,  -1.04102019e-03],
   [  1.93154992e-03,   2.45374075e-03,  -1.17614144e-03, ...,
     -2.33196606e-03,   1.60940753e-03,   2.04974586e-03]])

1 Comment

This is perfect too -- especially for more complex types of regressions. Seeing how the math works is useful, but it starts to get a bit beyond me for things like lasso and elastic net.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.