3

I have two arrays named u, v, e.g.

u = np.array([1.0,2.0,2.0,3.0,4.0])
v = np.array([10.0,21.0,18.0,30.0,40.0])
a = np.array([100.0,210.0,220.0,300.0,400.0])

If two elements in u are same, then delete that one which is higher in v value. For the above example, the result should be

u_new = np.array([1.0,2.0,3.0,4.0])
v_new = np.array([10.0,18.0,30.0,40.0])
a_new = np.array([100.0,220.0,300.0,400.0])

def remove_duplicates(u,v,a):
    u_new, indices = np.unique(u, return_index=True)
    v_new = np.zeros(len(u_new), dtype=np.float64)
    a_new = np.zeros(len(u_new), dtype=np.float64)
    for i in range(len(indices)):
        j1 = indices[i]
        if i < len(indices) - 1:
            j2 = indices[i + 1]
        else:
            j2 = j1 + 1
        v_new[i] = np.amin(v[j1:j2])
        k = np.argmin(v[j1:j2]) + j1
        a_new[i] = a[k]

    return u_new, v_new, a_new

The above code has a problem when treat floating number because there is not exact equality between two floating number. So I have to change it to a very 'stupid' way

def remove_duplicates(u, v, a):
    u_new = u
    v_new = v
    a_new = a
    cnt = 0
    for i in range(len(u)):
        if cnt <1:
            u_new[cnt] = u[i]
            v_new[cnt] = v[i]
            a_new[cnt] = a[i]
            cnt += 1
        else:
            if abs(u[i]-u_new[cnt-1]) > 1e-5:
                u_new[cnt] = u[i]
                v_new[cnt] = v[i]
                a_new[cnt] = a[i]
                cnt += 1
            else:
                print("Two points with same x coord found.ignore", i)
                if v_new[cnt-1] > v[i]:
                    v_new[cnt-1] = v[i]
                    a_new[cnt-1] = a[i]

    return u_new[:cnt], v_new[:cnt], a_new[:cnt]

How can I program it in a Pythonic way?

4
  • Constructing a new array, by looping over the first two arrays seems most feasible to me. I think no in-place operation is preferable. Commented Dec 6, 2016 at 2:08
  • Thank you for your comment. I want more python-like code to do this as the loop on array is time-consuming I think. Commented Dec 6, 2016 at 4:59
  • Are the arrays always 1D and sorted as per your example? Commented Dec 6, 2016 at 7:09
  • Yes, they are always 1D array. Commented Dec 7, 2016 at 1:40

3 Answers 3

1

This should work with a threshhold value to clean up your floats:

def remove_duplicates(u, v, a, d=1e-5):
    s = np.argsort(u)
    ud = abs(u[s][1:] - u[s][:-1]) < d
    vd = v[s][1:] < v[s][:-1]
    drop = np.union1d(s[:-1][ud & vd], s[1:][ud & ~vd])
    return np.delete(u, drop), np.delete(v, drop), np.delete(a, drop)
Sign up to request clarification or add additional context in comments.

Comments

1

You can use zip, sorted and groupby functions:

from itertools import groupby
u1, v1 = zip(*[next(g) for k, g in groupby(sorted(zip(u, v)), key = lambda x: x[0])])
# note here use next to take the first element(smaller v value) from each group    

u1
# (1.0, 2.0, 3.0, 4.0)

v1
# (10.2, 22.0, 28.0, 41.0)

4 Comments

very beautiful solution. Thank you a lot.
When the numpy array elements are float numbers, the method above will have problem because the comparison between two float numbers is not exact without tolerance given. How can I give the tolerance and make it works? Thank you.
I think you can round the key to the precision you wanted. u1, v1 = zip(*[next(g) for k, g in groupby(sorted(zip(u, v)), key = lambda x: round(x[0], 10))]) for instance.
Rouding the float to some precision is an alternative method for this. Thank you.
1

Approach #1 : Here's an approach for floating-pt numbers by slitting into groups of tolerable (by given tolerance value) proximity -

tol = 1e-5 # Set tolerance for floating pt number match
A = np.split( v, np.flatnonzero(np.diff(u) > tol)+1)
lens = np.array(list(map(len,A)))
idx = np.array([np.argmax(i) for i in A]) 
idx[1:] += lens[:-1].cumsum()
m = ~np.in1d(np.arange(a.size), idx[lens>1])
u_new, v_new, a_new = u[m], v[m], a[m]

Sample input, output -

In [143]: u=np.array([1.0,2.0,2.00000001,3.0,3.9999998, 4.0, 4.00000001])
     ...: v=np.array([10.0,21.0,18.0,30.0,36.0, 40.0, 38.0])
     ...: a=np.array([100.0,210.0,220.0,300.0,77.0, 400.0, 67.00])
     ...: 

In [144]: u_new
Out[144]: array([ 1.        ,  2.00000001,  3.        ,  3.9999998 ,  4.00000001])

In [145]: v_new
Out[145]: array([ 10.,  18.,  30.,  36.,  38.])

In [146]: a_new
Out[146]: array([ 100.,  220.,  300.,   77.,   67.])

Approach #2 : Here's another approach without splitting and as such must be more efficient -

u_idx = np.append(False, np.diff(u) > tol).cumsum()
max_idx = (np.append(np.unique(u_idx, return_index=1)[1], u_idx.size)-1)[1:]
sidx = (v.max()*u_idx + v).argsort()
m = ~np.in1d(np.arange(a.size), sidx[max_idx][np.bincount(u_idx)>1])
u_new, v_new, a_new = u[m], v[m], a[m]

1 Comment

Thank you for your two approaches. I believe both of them will work well though I am not sure I can understand them thoroughly.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.