7

Note this is not a question about multiple regression, it is a question about doing simple (single-variable) regression multiple times in Python/NumPy (2.7).

I have two m x n arrays x and y. The rows correspond to each other, and each pair is the set of (x,y) points for a measurement. That is, plt.plot(x.T, y.T, '.') would plot each of m datasets/measurements.

I'm wondering what the best way to perform the m linear regressions is. Currently I loop over the rows and use scipy.stats.linregress(). (Assume I don't want solutions based on doing linear algebra with the matrices but instead want to work with this function, or an equivalent black-box function.) I could try np.vectorize, but the docs indicate it also loops.

With some experimenting, I've also found a way to use list comprehensions with map() and get correct results. I've put both solutions below. In IPython, `%%timeit`` returns, using a small dataset (commented out):

(loop) 1000 loops, best of 3: 642 µs per loop
(map) 1000 loops, best of 3: 634 µs per loop

To try magnifying this, I made a much bigger random dataset (dimension trials x trials):

(loop, trials = 1000)  1 loops, best of 3: 299 ms per loop
(loop, trials = 10000) 1 loops, best of 3: 5.64 s per loop
(map, trials = 1000)   1 loops, best of 3: 256 ms per loop
(map, trials = 10000)  1 loops, best of 3: 2.37 s per loop

That's a decent speedup on a really big set, but I was expecting a bit more. Is there a better way?

import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats
np.random.seed(42)
#y = np.array(((0,1,2,3),(1,2,3,4),(2,4,6,8)))
#x = np.tile(np.arange(4), (3,1))
trials = 1000
y = np.random.rand(trials,trials)
x = np.tile(np.arange(trials), (trials,1))
num_rows = shape(y)[0]
slope = np.zeros(num_rows)
inter = np.zeros(num_rows)
for k, xrow in enumerate(x):
    yrow = y[k,:]
    slope[k], inter[k], t1, t2, t3 = stats.linregress(xrow, yrow)
#plt.plot(x.T, y.T, '.')
#plt.hold = True
#plt.plot(x.T, x.T*slope + intercept)
# Can the loop be removed?
tempx = [x[k,:] for k in range(num_rows)]
tempy = [y[k,:] for k in range(num_rows)]
results = np.array(map(stats.linregress, tempx, tempy))
slope_vec = results[:,0]
inter_vec = results[:,1]
#plt.plot(x.T, y.T, '.')
#plt.hold = True
#plt.plot(x.T, x.T*slope_vec + inter_vec)
print "Slopes equal by both methods?: ", np.allclose(slope, slope_vec)
print "Inters equal by both methods?: ", np.allclose(inter, inter_vec)
2
  • If you have a multi-core machine an easy thing to do is to use IPythons`s parallel map(). Checkout ipython.org/ipython-doc/dev/parallel/… Commented Mar 25, 2014 at 21:49
  • Thanks for the pointer, didn't realize that existed. Unfortunately, I'm pulling into IPython just to isolate this bit of code. And in my application, the data sets are small enough that any of these techniques are fine, I just want to figure out the best way of doing this for when I inevitably hit a dataset where performance matters. Commented Mar 27, 2014 at 14:51

1 Answer 1

3

Single variable linear regression is simple enough to vectorize it manually:

def multiple_linregress(x, y):
    x_mean = np.mean(x, axis=1, keepdims=True)
    x_norm = x - x_mean
    y_mean = np.mean(y, axis=1, keepdims=True)
    y_norm = y - y_mean

    slope = (np.einsum('ij,ij->i', x_norm, y_norm) /
             np.einsum('ij,ij->i', x_norm, x_norm))
    intercept = y_mean[:, 0] - slope * x_mean[:, 0]

    return np.column_stack((slope, intercept))

With some made up data:

m = 1000
n = 1000
x = np.random.rand(m, n)
y = np.random.rand(m, n)

it outperforms your looping options by a fair margin:

%timeit multiple_linregress(x, y)
100 loops, best of 3: 14.1 ms per loop
Sign up to request clarification or add additional context in comments.

3 Comments

Thanks, and true, it's fairly trivial to vectorize manually - and I think an efficient solution can be found using numpy.linalg.lstsq if I want to spend some time playing with the linear algebra. My hope is to find a Pythonic way to vectorize functions - I thought map() would do the trick, but based on performance it seems like it's implemented via looping as well.
Looping is not the only thing that is expensive in Python. If you cythonized the loop but kept a Python function call at each iteration, I don't think you would see a big improvement.
Belated acceptance, thanks for giving performance data with your answer.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.