2

If I have the following numpy array:

import numpy as np

arr = np.array([[285, 849],
                [399, 715],
                [399, 716],
                [400, 715],
                [400, 716]])

How would I approach removing near-identical rows? I do not mind whether I end up with the row [399, 715], [399, 716], [400, 715] or [400, 716]. As end result I would, for instance, like to get:

out = remove_near_identical(arr)
print(out)

[[285 849]
 [399 715]]
6
  • I would use something like: import numpy_indexed as npi npi.group_by(a[:, 0]).split(a[:, 1]). However, I would change the expression a[:, 0] so as to have each row rounded to the nearest 10 for example. I would then pick up the first element of each group. Commented May 13, 2020 at 20:45
  • Is there a criteria to determine near-identical? Commented May 13, 2020 at 20:52
  • @Divakar I suppose they can be +/- 3 apart in either column. Would probably be better if you could specify a criteria. Commented May 13, 2020 at 20:55
  • I'd definitely start by clearly defining your criteria. No point in doing anything until you know what you're shooting for. Commented May 13, 2020 at 21:14
  • @MadPhysicist Well, couldn't you specify that in the function itself? e.g. row distance can be at most x Commented May 13, 2020 at 21:21

3 Answers 3

2

The following assumes you have a 2D dataset and it retains order. Elements would be removed if they have an average difference = ((array_1 - array_2)/#of_element) < threshold

import numpy as np


def remove_near_indentical_rows(arr, threshold):

    row, column = arr.shape
    arg = arr.argsort(axis=0)[:, 0]
    arr=arr[arg]

    arr_mask = np.zeros(row, dtype=bool)
    cur_row = arr[0]
    arr_mask[0] = True
    for i in range(1, row):
        if np.sum(np.abs(arr[i] - cur_row))/column > threshold:
            arr_mask[i] = True
            cur_row = arr[i]

    arg = arg[arr_mask]
    return arr[arg]

arr = np.array([[399, 715],
                [285, 849],
                [399, 716],
                [400, 715],
                [400, 716]])

arr = remove_near_indentical_rows(arr, 10)
print(arr)

Outputs

[[399 715]
 [285 849]]
Sign up to request clarification or add additional context in comments.

Comments

2

Approach #1

Well if you are not sure about the deciding the near-identical criteria, I think a well-known one would be based on distances among them. With that in mind some sort of distance based clustering solution could be a good fit here. So, here's one with sklearn.cluster.AgglomerativeClustering -

from sklearn.cluster import AgglomerativeClustering

def cluster_based_on_distance(a, dist_thresh=10):
    kmeans= AgglomerativeClustering(n_clusters=None, distance_threshold=dist_thresh).fit(a)
    return a[np.sort(np.unique(kmeans.labels_, return_index=True)[1])]

Sample runs -

In [16]: a
Out[16]: 
array([[285, 849],
       [450, 717],
       [399, 715],
       [399, 716],
       [400, 715],
       [450, 716],
       [150, 716]])

In [17]: cluster_based_on_distance(a, dist_thresh=10)
Out[17]: 
array([[285, 849],
       [450, 717],
       [399, 715],
       [150, 716]])

In [18]: cluster_based_on_distance(a, dist_thresh=100)
Out[18]: 
array([[285, 849],
       [450, 717],
       [150, 716]])

In [19]: cluster_based_on_distance(a, dist_thresh=1000)
Out[19]: array([[285, 849]])

Approach #2

Another based on euclidean-distance based thresholding with KDTree -

from scipy.spatial import cKDTree

def cluster_based_on_eucl_distance(a, dist_thresh=10):
    d,idx = cKDTree(a).query(a, k=2)
    min_idx = idx.min(1)
    mask = d[:,1]>dist_thresh
    mask[min_idx[~mask]] = True
    return a[mask]

Approach #3

Another based on absolute differences between either of the columns -

def cluster_based_on_either_xydist(a, dist_thresh=10):
    c0 = np.abs(a[:,0,None]-a[:,0])<dist_thresh
    c1 = np.abs(a[:,1,None]-a[:,1])<dist_thresh
    c01 = c0 & c1
    return a[~np.triu(c01,1).any(0)]

Comments

2

An method based only on distances:

import numpy as np
from scipy.spatial.distance import cdist

arr = np.array([[285, 849],
                [399, 715],
                [399, 716],
                [400, 715],
                [400, 716]])

# get distances between every set of points
dists = cdist(arr, arr)
dists[np.isclose(dists, 0)] = np.inf # set 0 (self) distances to be large, ie. ignore

# get indices of points less than some threshold value (too close)
i, j = np.where(dists <= 1)
# get the unique indices from either i or j
# and delete all but one of these points from the original array
np.delete(arr, np.unique(i)[1:], axis=0)
>>> array([[285, 849],
           [399, 715]])

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.