3

I found an example of linear regression:
http://docs.scipy.org/doc/numpy/reference/generated/numpy.linalg.lstsq.html#numpy.linalg.lstsq

x = np.array([0, 1, 2, 3])
y = np.array([-1, 0.2, 0.9, 2.1])
A = np.vstack([x, np.ones(len(x))]).T
m, c = np.linalg.lstsq(A, y)[0]
print m, c

My situation is: some element of y is missing, so x and y are not same length. it need some intel to judge which position is missing, so rm it. is there method at hand, or should i do it myself?

e.g.:

x=range(10)
y=[i*3+5 for  i in x]
y.pop(3) #make a missing

i don't known which position is missing. But consider slope change on average, possibly position 4 of y is missing.
this maybe a question on special domain

1
  • 2
    How can you determine which position is missing? This problem as posed is ill-constrained. Commented Aug 25, 2012 at 15:12

3 Answers 3

4

I'm afraid you gonna be in troubles with your way of making missing values:

y=[i*3+5 for  i in x]
y.pop(3) #make a missing

You specifically want to make the 3rd element missing, but what happens now? How are you supposed to tell your script that in fact, the initial 3rd element is missing ?

I would suggest to flag your missing values as np.nan (provided they're all floats, of course). Then, finding what values are missing is easy:

missing = np.isnan(y)

Now, you can remove the entries of x and A where y is missing, ie where y is np.nan:

Anew = A[~missing]
ynew = y[~missing]

m, c = np.linalg.lstsq(Anew, ynew)[0]
print m, c

(the ~ operator transforms your True as False and vice-versa: you're selecting entries where y is not np.nan)

If your y are actually integers, that won't work, as np.nan is only for floats. You could then use the np.ma module.

my = np.ma.masked_array(y)
my[3] = np.ma.masked

Anew = A[~my.mask]
ynew = my.compressed()

m, c = np.linalg.lstsq(Anew, ynew)[0]
print m, c
Sign up to request clarification or add additional context in comments.

1 Comment

the problem is, any position may be missing. i don't know it's 3. 3'rd seems likely missing. Or after iterating all possibilities , missing 3 will result in a complete match.
1

I am assuming that you know which of the x's are associated with missing elements of y.

In this case, you have a transductive learning problem, because you want to estimate values of y for known positions of x.

In the probabilistic linear regression formulation, learning a distribution p(y|x), it turns out that there is no difference between the transductive solution and the answer you get by just running regression after removing the x's with no associated y's.

So the answer is - just remove the x's with no associated y's and run linear regression on the reduced problem.

Comments

0

i have a rough solution below:

def slope(X,Y,i):
    res = (Y[i]-Y[0])*1.0/(X[i]-X[0])
    return res

len_thold=0.2

def notgood(lst1,lst2):
    if len(lst1)<2 or len(lst2)<2:
        return True
    return  False

def adjust_miss(X,Y):
    slope_thold=1.1
    if len(X)==len(Y):
        return
    newlen=min(len(X),len(Y))  
    if len(Y)-len(X)<0:
        aim=X
    else:
        aim=Y
    difflen=abs(len(Y)-len(X))
    roughk=slope(X,Y,newlen-1)
    for i in xrange(1,newlen):
        if difflen==0:
            break
        k=slope(X,Y,i)
        if (len(Y)<len(X) and k>slope_thold*roughk) or (len(Y)>len(X) and k<1.0/(slope_thold*roughk)):
            aim.pop(i)
            difflen-=1
    if difflen>0:
        for i in xrange(difflen):
            aim.pop(-1) 
    assert len(X) == len(Y)

def test_adjust():
    X=range(10)
    Y=range(10)
    Y.pop(3)
    adjust_miss(X,Y)
    print X,Y

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.