29

I tried to find entries in an Array containing a substring with np.where and an in condition:

import numpy as np
foo = "aa"
bar = np.array(["aaa", "aab", "aca"])
np.where(foo in bar)

this only returns an empty Array.
Why is that so?
And is there a good alternative solution?

5 Answers 5

34

We can use np.core.defchararray.find to find the position of foo string in each element of bar, which would return -1 if not found. Thus, it could be used to detect whether foo is present in each element or not by checking for -1 on the output from find. Finally, we would use np.flatnonzero to get the indices of matches. So, we would have an implementation, like so -

np.flatnonzero(np.core.defchararray.find(bar,foo)!=-1)

Sample run -

In [91]: bar
Out[91]: 
array(['aaa', 'aab', 'aca'], 
      dtype='|S3')

In [92]: foo
Out[92]: 'aa'

In [93]: np.flatnonzero(np.core.defchararray.find(bar,foo)!=-1)
Out[93]: array([0, 1])

In [94]: bar[2] = 'jaa'

In [95]: np.flatnonzero(np.core.defchararray.find(bar,foo)!=-1)
Out[95]: array([0, 1, 2])
Sign up to request clarification or add additional context in comments.

10 Comments

this works perfectly. Thank you very much! But out of curiosity do you know why the in condition in np.where doesnt work?
@SiOx AFAIK foo being a NumPy array doesn't work with in. That in is meant for Python lists, etc. if that makes sense?
in does work with an array, that is ndarray has a __contains__ method. But behavior is similar to that of list.
np.char.find is the shorthand for this function.
this does only work partly when there is no space in there. if elements ' aaa' and ' aab' is used for the case above, (which has the space at the very front) it would not work
|
6

Look at some examples of using in:

In [19]: bar = np.array(["aaa", "aab", "aca"])

In [20]: 'aa' in bar
Out[20]: False

In [21]: 'aaa' in bar
Out[21]: True

In [22]: 'aab' in bar
Out[22]: True

In [23]: 'aab' in list(bar) 

It looks like in when used with an array works as though the array was a list. ndarray does have a __contains__ method, so in works, but it is probably simple.

But in any case, note that in alist does not check for substrings. The strings __contains__ does the substring test, but I don't know any builtin class that propagates the test down to the component strings.

As Divakar shows there is a collection of numpy functions that applies string methods to individual elements of an array.

In [42]: np.char.find(bar, 'aa')
Out[42]: array([ 0,  0, -1])

Docstring:
This module contains a set of functions for vectorized string operations and methods. The preferred alias for defchararray is numpy.char.

For operations like this I think the np.char speeds are about same as with:

In [49]: np.frompyfunc(lambda x: x.find('aa'), 1, 1)(bar)
Out[49]: array([0, 0, -1], dtype=object)

In [50]: np.frompyfunc(lambda x: 'aa' in x, 1, 1)(bar)
Out[50]: array([True, True, False], dtype=object)

Further tests suggest that the ndarray __contains__ operates on the flat version of the array - that is, shape doesn't affect its behavior.

Comments

5

If using pandas is acceptable, then utilizing the str.contains method can be used.

import numpy as np
entries = np.array(["aaa", "aab", "aca"])

import pandas as pd
pd.Series(entries).str.contains('aa') # <----

Results in:

0     True
1     True
2    False
dtype: bool

The method also accepts regular expressions for more complex patterns:

pd.Series(entries).str.contains(r'a.a')

Results in:

0     True
1    False
2     True
dtype: bool

Comments

3

The way you are trying to use np.where is incorrect. The first argument of np.where should be a boolean array, and you are simply passing it a boolean.

foo in bar
>>> False
np.where(False)
>>> (array([], dtype=int32),)
np.where(np.array([True, True, False]))
>>> (array([0, 1], dtype=int32),)

The problem is that numpy does not define the in operator as an element-wise boolean operation.

One way you could accomplish what you want is with a list comprehension.

foo = 'aa'
bar = np.array(['aaa', 'aab', 'aca'])
out = [i for i, v in enumerate(bar) if foo in v]
# out = [0, 1]

bar = ['aca', 'bba', 'baa', 'aaf', 'ccc']
out = [i for i, v in enumerate(bar) if foo in v]
# out = [2, 3]

Comments

1

You can also do something like this:

mask = [foo in x for x in bar]  
filter = bar[ np.where( mask * bar != '') ]

1 Comment

Hi and welcome to Stack Overflow! While this answer may solve the problem it does not try to answer the question of why the original code wasn't working. Could you please edit your question to explain this too? Thanks!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.