how to do linear regression in python, with missing elements

Question

I found an example of linear regression:
http://docs.scipy.org/doc/numpy/reference/generated/numpy.linalg.lstsq.html#numpy.linalg.lstsq

x = np.array([0, 1, 2, 3])
y = np.array([-1, 0.2, 0.9, 2.1])
A = np.vstack([x, np.ones(len(x))]).T
m, c = np.linalg.lstsq(A, y)[0]
print m, c

My situation is: some element of y is missing, so x and y are not same length. it need some intel to judge which position is missing, so rm it. is there method at hand, or should i do it myself?

e.g.:

x=range(10)
y=[i*3+5 for  i in x]
y.pop(3) #make a missing

i don't known which position is missing. But consider slope change on average, possibly position 4 of y is missing.
this maybe a question on special domain

How can you determine which position is missing? This problem as posed is ill-constrained. — mgilson
– mgilson, Commented Aug 25, 2012 at 15:12

Pierre GM · Accepted Answer · 2012-08-26 14:18:50Z

4

I'm afraid you gonna be in troubles with your way of making missing values:

y=[i*3+5 for  i in x]
y.pop(3) #make a missing

You specifically want to make the 3rd element missing, but what happens now? How are you supposed to tell your script that in fact, the initial 3rd element is missing ?

I would suggest to flag your missing values as np.nan (provided they're all floats, of course). Then, finding what values are missing is easy:

missing = np.isnan(y)

Now, you can remove the entries of x and A where y is missing, ie where y is np.nan:

Anew = A[~missing]
ynew = y[~missing]

m, c = np.linalg.lstsq(Anew, ynew)[0]
print m, c

(the ~ operator transforms your True as False and vice-versa: you're selecting entries where y is not np.nan)

If your y are actually integers, that won't work, as np.nan is only for floats. You could then use the np.ma module.

my = np.ma.masked_array(y)
my[3] = np.ma.masked

Anew = A[~my.mask]
ynew = my.compressed()

m, c = np.linalg.lstsq(Anew, ynew)[0]
print m, c

answered Aug 26, 2012 at 14:18

Pierre GM

20.5k3 gold badges58 silver badges67 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

whi Over a year ago

the problem is, any position may be missing. i don't know it's 3. 3'rd seems likely missing. Or after iterating all possibilities , missing 3 will result in a complete match.

user1149913 · Accepted Answer · 2012-08-26 00:41:42Z

1

I am assuming that you know which of the x's are associated with missing elements of y.

In this case, you have a transductive learning problem, because you want to estimate values of y for known positions of x.

In the probabilistic linear regression formulation, learning a distribution p(y|x), it turns out that there is no difference between the transductive solution and the answer you get by just running regression after removing the x's with no associated y's.

So the answer is - just remove the x's with no associated y's and run linear regression on the reduced problem.

answered Aug 26, 2012 at 0:41

user1149913

4,5531 gold badge26 silver badges28 bronze badges

Comments

whi · Accepted Answer · 2012-08-26 03:09:14Z

i have a rough solution below:

def slope(X,Y,i):
    res = (Y[i]-Y[0])*1.0/(X[i]-X[0])
    return res

len_thold=0.2

def notgood(lst1,lst2):
    if len(lst1)<2 or len(lst2)<2:
        return True
    return  False

def adjust_miss(X,Y):
    slope_thold=1.1
    if len(X)==len(Y):
        return
    newlen=min(len(X),len(Y))  
    if len(Y)-len(X)<0:
        aim=X
    else:
        aim=Y
    difflen=abs(len(Y)-len(X))
    roughk=slope(X,Y,newlen-1)
    for i in xrange(1,newlen):
        if difflen==0:
            break
        k=slope(X,Y,i)
        if (len(Y)<len(X) and k>slope_thold*roughk) or (len(Y)>len(X) and k<1.0/(slope_thold*roughk)):
            aim.pop(i)
            difflen-=1
    if difflen>0:
        for i in xrange(difflen):
            aim.pop(-1) 
    assert len(X) == len(Y)

def test_adjust():
    X=range(10)
    Y=range(10)
    Y.pop(3)
    adjust_miss(X,Y)
    print X,Y

Collectives™ on Stack Overflow

how to do linear regression in python, with missing elements

3 Answers 3

1 Comment

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related