Remove near-identical rows numpy array

Question

If I have the following numpy array:

import numpy as np

arr = np.array([[285, 849],
                [399, 715],
                [399, 716],
                [400, 715],
                [400, 716]])

How would I approach removing near-identical rows? I do not mind whether I end up with the row [399, 715], [399, 716], [400, 715] or [400, 716]. As end result I would, for instance, like to get:

out = remove_near_identical(arr)
print(out)

[[285 849]
 [399 715]]

I would use something like: import numpy_indexed as npi npi.group_by(a[:, 0]).split(a[:, 1]). However, I would change the expression a[:, 0] so as to have each row rounded to the nearest 10 for example. I would then pick up the first element of each group. — Tarik
– Tarik, Commented May 13, 2020 at 20:45
@Divakar I suppose they can be +/- 3 apart in either column. Would probably be better if you could specify a criteria. — Menno Van Dijk
– Menno Van Dijk, Commented May 13, 2020 at 20:55
I'd definitely start by clearly defining your criteria. No point in doing anything until you know what you're shooting for. — Mad Physicist
– Mad Physicist, Commented May 13, 2020 at 21:14
@MadPhysicist Well, couldn't you specify that in the function itself? e.g. row distance can be at most x — Menno Van Dijk
– Menno Van Dijk, Commented May 13, 2020 at 21:21

shuvo · Accepted Answer · 2020-05-13 23:42:19Z

2

The following assumes you have a 2D dataset and it retains order. Elements would be removed if they have an average difference = ((array_1 - array_2)/#of_element) < threshold

import numpy as np


def remove_near_indentical_rows(arr, threshold):

    row, column = arr.shape
    arg = arr.argsort(axis=0)[:, 0]
    arr=arr[arg]

    arr_mask = np.zeros(row, dtype=bool)
    cur_row = arr[0]
    arr_mask[0] = True
    for i in range(1, row):
        if np.sum(np.abs(arr[i] - cur_row))/column > threshold:
            arr_mask[i] = True
            cur_row = arr[i]

    arg = arg[arr_mask]
    return arr[arg]

arr = np.array([[399, 715],
                [285, 849],
                [399, 716],
                [400, 715],
                [400, 716]])

arr = remove_near_indentical_rows(arr, 10)
print(arr)

Outputs

[[399 715]
 [285 849]]

edited May 13, 2020 at 23:42

answered May 13, 2020 at 23:35

shuvo

1,9745 gold badges34 silver badges42 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Divakar · Accepted Answer · 2020-05-13 21:46:40Z

Approach #1

Well if you are not sure about the deciding the near-identical criteria, I think a well-known one would be based on distances among them. With that in mind some sort of distance based clustering solution could be a good fit here. So, here's one with sklearn.cluster.AgglomerativeClustering -

from sklearn.cluster import AgglomerativeClustering

def cluster_based_on_distance(a, dist_thresh=10):
    kmeans= AgglomerativeClustering(n_clusters=None, distance_threshold=dist_thresh).fit(a)
    return a[np.sort(np.unique(kmeans.labels_, return_index=True)[1])]

Sample runs -

In [16]: a
Out[16]: 
array([[285, 849],
       [450, 717],
       [399, 715],
       [399, 716],
       [400, 715],
       [450, 716],
       [150, 716]])

In [17]: cluster_based_on_distance(a, dist_thresh=10)
Out[17]: 
array([[285, 849],
       [450, 717],
       [399, 715],
       [150, 716]])

In [18]: cluster_based_on_distance(a, dist_thresh=100)
Out[18]: 
array([[285, 849],
       [450, 717],
       [150, 716]])

In [19]: cluster_based_on_distance(a, dist_thresh=1000)
Out[19]: array([[285, 849]])

Approach #2

Another based on euclidean-distance based thresholding with KDTree -

from scipy.spatial import cKDTree

def cluster_based_on_eucl_distance(a, dist_thresh=10):
    d,idx = cKDTree(a).query(a, k=2)
    min_idx = idx.min(1)
    mask = d[:,1]>dist_thresh
    mask[min_idx[~mask]] = True
    return a[mask]

Approach #3

Another based on absolute differences between either of the columns -

def cluster_based_on_either_xydist(a, dist_thresh=10):
    c0 = np.abs(a[:,0,None]-a[:,0])<dist_thresh
    c1 = np.abs(a[:,1,None]-a[:,1])<dist_thresh
    c01 = c0 & c1
    return a[~np.triu(c01,1).any(0)]

Paddy Harrison · Accepted Answer · 2022-04-06 09:32:03Z

2

An method based only on distances:

import numpy as np
from scipy.spatial.distance import cdist

arr = np.array([[285, 849],
                [399, 715],
                [399, 716],
                [400, 715],
                [400, 716]])

# get distances between every set of points
dists = cdist(arr, arr)
dists[np.isclose(dists, 0)] = np.inf # set 0 (self) distances to be large, ie. ignore

# get indices of points less than some threshold value (too close)
i, j = np.where(dists <= 1)
# get the unique indices from either i or j
# and delete all but one of these points from the original array
np.delete(arr, np.unique(i)[1:], axis=0)
>>> array([[285, 849],
           [399, 715]])

edited Apr 6, 2022 at 9:32

answered May 14, 2020 at 0:05

Paddy Harrison

2,0421 gold badge11 silver badges26 bronze badges

Collectives™ on Stack Overflow

Remove near-identical rows numpy array

3 Answers 3

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related