Numpy remove duplicate column values

Question

I have a numpy array as follows

array([[ 6,  5],
   [ 6,  9],
   [ 7,  5],
   [ 7,  9],
   [ 8, 10],
   [ 9, 10],
   [ 9, 11],
   [10, 10]])

I want to pick elements such that y coordinates are unique. If two y coordinates are same I want to pick element with lesser x coordinate.

Expected output

array([[ 6,  5],
   [ 6,  9],
   [ 8, 10],
   [ 9, 11]])

Explanation

pick [6,5] over [7,5]

pick [8,10] over [9,10] and [10,10]

pick [9, 11]

Thanks

I did a naive looping. Played around with unique didn't get anywhere — user009122
– user009122, Commented Jul 8, 2018 at 22:49
Is the array always sorted on x-coordinate as given in the example? — gaganso
– gaganso, Commented Jul 8, 2018 at 22:56

user3483203 · Accepted Answer · 2018-07-08 23:17:52Z

8

First, sort by the first column:

a = a[a[:, 0].argsort()]

Returning unique indices using np.unique with the return_index flag:

a[np.unique(a[:, 1], return_index=True)[1]]

array([[ 6,  5],
       [ 6,  9],
       [ 8, 10],
       [ 9, 11]])

Some timings:

a = np.random.randint(1, 10, 10000).reshape(-1, 2)

In [45]: %timeit rows_by_unique_y(a)
3.83 ms ± 137 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [46]: %timeit argsort_unique(a)
370 µs ± 8.26 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Yes, my approach uses an initial sort, but vectorized operations in numpy beat iteration in Python.

edited Jul 8, 2018 at 23:17

answered Jul 8, 2018 at 23:06

user3483203

51.3k10 gold badges72 silver badges104 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Alex Reinking Over a year ago

Sorting is asymptotically suboptimal for this problem. O(n log n) vs O(n)

user3483203 Over a year ago

Why don't you time both of our approaches, looping when using numpy is bad

Alex Reinking Over a year ago

+1, cheers. Yours indeed has a much lower constant! I often underestimate just how slow Python interpreters are.

user3483203 Over a year ago

@AlexReinking yea it's a counter-intuitive thing when working with numpy. In answers like this, I actually sort something twice when dropping duplicates and it still easily beats any vanilla python approach.

Alex Reinking Over a year ago

Playing with a few input array sizes, I can actually see the linear vs n-log-n scaling, but the crossing point is around 25 million rows

lakshayg · Accepted Answer · 2018-07-08 23:16:29Z

2

If you are open to using another library, I would suggest using numpy_indexed for an efficient and compact solution

import numpy as np
import numpy_indexed as npi

a = np.array([[6, 5], [6, 9], [7, 5], [7, 9], [8, 10], [9, 10], [9, 11], [10, 10]])

column_to_groupby = 1
groups, reduced = npi.group_by(a[:,column_to_groupby]).min(a)
print(reduced)

It gives the following output

[[ 6  5]
 [ 6  9]
 [ 8 10]
 [ 9 11]]

Here is the timeit result

In [5]: a = np.random.randint(1, 10, 10000).reshape(-1, 2)

In [6]: %timeit npi.group_by(a[:,1]).min(a)
354 µs ± 2.29 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

edited Jul 8, 2018 at 23:16

answered Jul 8, 2018 at 23:06

lakshayg

2,1932 gold badges22 silver badges36 bronze badges

Comments

Alex Reinking · Accepted Answer · 2018-07-08 23:42:30Z

1

One approach loop through the array and make a note of the best values you've seen, then reconstruct the array at the end:

import numpy as np

def rows_by_unique_y(arr):
  best_for_y = defaultdict(lambda: float('inf'))
  for i, row in enumerate(arr):
    x,y = row[0], row[1]
    best_for_y[y] = min(x, best_for_y[y])
  return np.array([[x,y] for y, x in best_for_y.items()])

arr = np.array([[6,  5], [6,  9], [7,  5], [7,  9], [8, 10], [9, 10], [9, 11], [10, 10]])
print(rows_by_unique_y(arr))

No need to sort, just keep track of the minimums. This outputs:

[[ 6  5]
 [ 6  9]
 [ 8 10]
 [ 9 11]]

While this answer is asymptotically faster, user3483203's answer is much better in practice. This is because it calls out to optimized C code rather than staying inside of Python's surprisingly slow interpreter. However, if your arrays are huge (several gigabytes) then the O(n log n) behavior will start to lose to this.

At the same time, if your arrays are that large, you should probably be using a MapReduce framework like Spark instead. The algorithm I gave above is easily parallelized.

If you don't need the minimum x values, then the following one-liner using np.unique works:

arr[np.unique(arr[:,1], return_index=True)[1]]

but this returns

array([[ 6,  5],
       [ 6,  9],
       [10, 10],
       [ 9, 11]])

if you switch the 8 and the 10.

edited Jul 8, 2018 at 23:42

answered Jul 8, 2018 at 22:59

Alex Reinking

20.7k8 gold badges68 silver badges110 bronze badges

4 Comments

user009122 Over a year ago

Thank you Alex. I was trying to avoid looping. Tried using something np.unique(d[:,1]) and play around with it

Alex Reinking Over a year ago

It's going to be difficult to square the behavior of np.unique with your need to keep the minimum x value. If you find that you need to sort, I would thoroughly test that solution against this one, since sorting will make the algorithm O(n log n) while my code runs in amortized (for the hash table) O(n).

Alex Reinking Over a year ago

@user009122 - I added how you would use np.unique if you didn't have the x-value constraint

user009122 Over a year ago

You are right about the complexity. My use case I was looking for a concise code and I am not too bothered about complexity

Collectives™ on Stack Overflow

Numpy remove duplicate column values

3 Answers 3

5 Comments

Comments

4 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

5 Comments

Comments

4 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related