2

I would like to check if the array b is a subset of the array a. By subset I mean I would like to check if all the elements of b are found in a.

Here is the code I have:

import numpy as np
a = np.array([[1,7,9],[8,3,12],[101,-74,0.5]])
b = np.array([[1,9],[8,12],[101,0.5]])
print a
print b

Here is the output

Array a

[[   1.     7.     9. ]
 [   8.     3.    12. ]
 [ 101.   -74.     0.5]]

Array b

[[   1.     9. ]
 [   8.    12. ]
 [ 101.     0.5]]

Is there a way to check if b is a subset of a?

EDIT: Additional Information:

As per comments below, I should clarify that I need to know if array b is a subset of array a - if even one element is missing from the subset, then I am looking for a way to check for this. I do not need to have an indication of where in the subset the element is missing but just to know it is missing. If additional information can be provided about the missing element then that will be a bonus but it is not a hard requirement. Apologies for not clearing this up earlier.

My reasoning in phrasing the question as a subset is that if one array is a subset of the other array then this would imply to me that all the values of the subset array are present in the larger array.

4
  • 1
    I think you need to elaborate on " I would like to check if all the elements of b are found in a" as we are dealing with 2D arrays here . Think of the various situations that might negate your definition of "subset", think of the other situations that must follow. All elements along the respective rows from a and b? Along the same column only in b? Commented May 16, 2016 at 19:29
  • Sorry I should have explained this. Check if all elements along respective columns of b are subsets of those in a. This is what I am after. Commented May 16, 2016 at 19:54
  • 1
    So the desired output in this case would be a bool array with three values of true, right? One for each row, which indeed have columns which are subsets. Commented May 16, 2016 at 20:55
  • How are you defining subset here? Are you looking for th existance of a pair of boolean masks such that (a[m1,m2] == b).all(), ie some subset of the rows and columns Commented May 17, 2016 at 1:39

4 Answers 4

5

I think you want numpy.in1d, something like this:

import numpy as np
a = np.array([[1,7,9],[8,3,12],[101,-74,0.5]])
b = np.array([[1,9],[8,12],[101,0.5]])

np.in1d(b.ravel(), a.ravel()).all()
Sign up to request clarification or add additional context in comments.

1 Comment

Thanks a lot. Although I accepted the earlier answer, this works as well. Simple one-liner.
3

If you want to compare columns, a way is to group them first :

a = np.array([[1,7,9],[8,3,12],[101,-74,0.5]])
b = np.array([[1,9],[8,12],[101,0.5]])
c = np.array([[1,9],[8,12],[101,-74.]])

def bycols(arr):
    tr=arr.T.copy()
    type=np.dtype((np.void,tr.strides[0]))
    return tr.view(type).squeeze()

A,B,C=[bycols(x) for x in (a,b,c)]    

Then A,B,C are just arrays of bytes representing columns:

In [5]: [x.shape for x in (A,B,C)]
Out[5]: [(3,), (2,), (2,)]

You can now test belonging with np.in1d :

In [6]: np.in1d(C,A)
Out[6]: array([ True, False], dtype=bool)

In [7]: np.in1d(B,A)
Out[7]: array([ True,  True], dtype=bool)

But :

In [8]: np.in1d(c,a)
Out[8]: array([ True,  True,  True,  True,  True,  True], dtype=bool)

since np1d apply on flattened arrays.

Comments

2

This should work:

set(np.unique(b)).issubset(set(np.unique(a)))

EDIT: The code above returns True or False rather than a column vector of booleans. From @Eelco Hoogendoorn's comment to your question, I understand that you are actually interested in checking whether a row of b is a subset of the corresponding row of a, right? Assuming that this is the correct problem description, the following one-liner should work:

np.array([[set(bi).issubset(set(ai))] for ai, bi in zip(map(tuple, a), map(tuple, b))])

The code above is simple, readable, and does not require third party dependencies. It is admittedly a quick and dirty solution, since as @Bi Rico correctly pointed out, such an approach can be pretty inefficient. If you need to handle large arrays you should stick to a vectorized algorithm.

4 Comments

Thanks. This works and it answers my question.
It's worth noting that for larger arrays, switching between numpy arrays and built in types like sets and lists can be pretty expensive in processing time and memory usage.
If I read your question correctly, this does not answer it; it regards all elements in the array at once, and does not act per-column.
Thank you for the feedback. Either option will work for me. I am just looking to check for the presence of False - this would indicate that one value (any value) is different. If this is the case, then I know that one array has a problem with it. If it is possible to have a comparison between rows, as you have done in your edit, then that is also useful but not necessary. Your initial answer works just fine because it identifies False or True and this is exactly that I was looking for. Thanks for the added solution.
1

If I read your question correctly (test for each corresponding row in a and b, if the row in b is a subset of the row in a), this should do it efficiently and correctly:

import numpy_indexed as npi
rowsa = np.indices(a.shape)[0]
rowsb = np.indices(b.shape)[0]
# test for each value-rowidx pair in b if it is contained in a
c = npi.contains((a.flatten(), rowsa.flatten()), (b.flatten(), rowsb.flatten()))
# check that all elements on a row are contained
row_is_subset = c.reshape(b.shape).all(axis=1)

You need to install the numpy_indexed package (disclaimer: I am its author)

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.