1

I have found other methods, such as this, to remove duplicate elements from an array. My requirement is slightly different. If I start with:

array([[1, 2, 3],
       [2, 3, 4],
       [1, 2, 3],
       [3, 2, 1],
       [3, 4, 5]])

I would like to end up with:

array([[2, 3, 4],
       [3, 2, 1]
       [3, 4, 5]])

That's what I would ultimately like to end up with, but there is an extra requirement. I would also like to store either an array of indices to discard, or to keep (a la numpy.take).

I am using Numpy 1.8.1

3
  • You can count how many time each row appears using methods suggested, for example, here and here. I think that's what your problem here reduces to. Commented Dec 6, 2015 at 21:51
  • @ajcr I can't use return_counts so #1 is out for me. Unfortunately #2 seems to require sorted array, and I need to preserve the order. Commented Dec 6, 2015 at 22:35
  • @codedog Were either of the answers helpful? If not, could you let us know what else you're looking for, Commented Dec 8, 2015 at 6:10

4 Answers 4

1

We want to find rows which are not duplicated in your array, while preserving the order.

I use this solution to combine each row of a into a single element, so that we can find the unique rows using np.unique(,return_index=True, return_inverse= True). Then, I modified this function to output the counts of the unique rows using the index and inverse. From there, I can select all unique rows which have counts == 1.

a = np.array([[1, 2, 3],
       [2, 3, 4],
       [1, 2, 3],
       [3, 2, 1],
       [3, 4, 5]])

#use a flexible data type, np.void, to combine the columns of `a`
#size of np.void is the number of bytes for an element in `a` multiplied by number of columns
b = a.view(np.dtype((np.void, a.dtype.itemsize * a.shape[1])))
_, index, inv = np.unique(b, return_index = True, return_inverse = True)

def return_counts(index, inv):
    count = np.zeros(len(index), np.int)
    np.add.at(count, inv, 1)
    return count

counts = return_counts(index, inv)

#if you want the indices to discard replace with: counts[i] > 1
index_keep = [i for i, j in enumerate(index) if counts[i] == 1]

>>>a[index_keep]
array([[2, 3, 4],
   [3, 2, 1],
   [3, 4, 5]])

#if you don't need the indices and just want the array returned while preserving the order
a_unique = np.vstack(a[idx] for i, idx in enumerate(index) if counts[i] == 1])
>>>a_unique
array([[2, 3, 4],
   [3, 2, 1],
   [3, 4, 5]])

For np.version >= 1.9

b = a.view(np.dtype((np.void, a.dtype.itemsize * a.shape[1])))
_, index, counts = np.unique(b, return_index = True, return_counts = True)

index_keep = [i for i, j in enumerate(index) if counts[i] == 1]
>>>a[index_keep]
array([[2, 3, 4],
   [3, 2, 1],
   [3, 4, 5]])
Sign up to request clarification or add additional context in comments.

3 Comments

except the request was to exclude [1,2,3] since it occurs more than once
@Dan Patterson Thanks for pointing it out, I have edited my solution.
Yesterday I found out we have implemented this in a C extension. I have not tested this solution explicitly but it looks very similar to what has been implemented here. That's why I accepted it as a solution. Thanks.
1

You can proceed as follows:

# Assuming your array is a
uniq, uniq_idx, counts = np.unique(a, axis=0, return_index=True, return_counts=True)

# to return the array you want
new_arr = uniq[counts == 1]

# The indices of non-unique rows
a_idx = np.arange(a.shape[0]) # the indices of array a
nuniq_idx = a_idx[np.in1d(a_idx, uniq_idx[counts==1], invert=True)] 

You get:

#new_arr
array([[2, 3, 4],
       [3, 2, 1],
       [3, 4, 5]])

# nuniq_idx
array([0, 2])

Comments

0

If you want to delete all instances of the elements, that exists in duplicate versions, you could iterate through the array, find the indexes of elements existing in more than one version, and lastly delete these:

# The array to check:
array = numpy.array([[1, 2, 3],
        [2, 3, 4],
        [1, 2, 3],
        [3, 2, 1],
        [3, 4, 5]])

# List that contains the indices of duplicates (which should be deleted)
deleteIndices = []

for i in range(0,len(array)): # Loop through entire array
    indices = range(0,len(array)) # All indices in array
    del indices[i] # All indices in array, except the i'th element currently being checked

for j in indexes: # Loop through every other element in array, except the i'th element, currently being checked
    if(array[i] == array[j]).all(): # Check if element being checked is equal to the j'th element
        deleteIndices.append(j) # If i'th and j'th element are equal, j is appended to deleteIndices[]

# Sort deleteIndices in ascending order:
deleteIndices.sort()

# Delete duplicates
array = numpy.delete(array,deleteIndices,axis=0)

This outputs:

>>> array
array([[2, 3, 4],
       [3, 2, 1],
       [3, 4, 5]])

>>> deleteIndices
[0, 2]

Like that you both delete the duplicates and get a list of indices to discard.

Comments

0

The numpy_indexed package (disclaimer: I am its author) can be used to solve such problems in a vectorized manner:

index = npi.as_index(arr)
keep = index.count == 1
discard = np.invert(keep)
print(index.unique[keep])

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.