Finding a Substring in a Dataframe from a Numpy Array?

Question

How do I find a list of substrings in a dataframe against an array and use the array values to create a new column? For example, I started off using str.contains and typing out the actual string value(see below).

import pandas as pd
import numpy as np

#Filepath directory
csv_report = filepath

#Creates dataframe of CSV report
csv_df = pd.read_csv(csv_report)
  
csv_df['animal'] = np.where(csv_df.item_name.str.contains('Condor'), "Condor",
                   np.where(csv_df.item_name.str.contains('Marmot'), "Marmot",
                   np.where(csv_df.item_name.str.contains('Bear'),"Bear",
                   np.where(csv_df.item_name.str.contains('Pika'),"Pika",
                   np.where(csv_df.item_name.str.contains('Rat'),"Rat",
                   np.where(csv_df.item_name.str.contains('Racoon'),"Racoon",
                   np.where(csv_df.item_name.str.contains('Opossum'),"Opossum")))))))

How would I go about achieving the above code if the string values are in an array instead? Sample below:

import pandas as pd
import numpy as np

#Filepath directory
csv_report = filepath

#Creates dataframe of CSV report
csv_df = pd.read_csv(csv_report)

animal_list = np.array(['Condor', 'Marmot','Bear','Pika','Rat','Racoon','Opossum'])

I'm not sure that using an array here helps. A list of strings is just as good. Pandas doesn't use numpy string dtypes for its strings. Columns with strings are object dtype, with Python strings. Use string methods and pandas own string enhancements. — hpaulj
– hpaulj, Commented Oct 5, 2021 at 0:45
Hi and welcome on SO. It will be great if you can have a look at How to Ask and then try to produce a minimal reproducible example. In this case @Jonathan Leon was kind enough to produce an example for you. But you should always try to write data for your question. — rpanai
– rpanai, Commented Oct 5, 2021 at 1:20

rpanai · Accepted Answer · 2021-10-05 12:04:09Z

2

There is a better way than using apply or several np.where. Have a look at np.select. Here as on the other answer we are assuming that each row can have only one match

Data

Stolen from @Jonathan Leon

import pandas as pd
import numpy as np
data = ['Condor', 
        'Marmot',
        'Bear',
        'Condor a',
        'Marmotb',
        'Bearxyz']

df = pd.DataFrame(data, columns=["item_name"])

animal_list = ['Condor', 
               'Marmot',
               'Bear',
               'Pika',
               'Rat',
               'Racoon',
               'Opossum']

Define conditions for numpy select

cond_list = [df["item_name"].str.contains(animal) 
             for animal in animal_list]

df["animal"] = np.select(cond_list, animal_list)

output


  item_name  animal
0    Condor  Condor
1    Marmot  Marmot
2      Bear    Bear
3  Condor a  Condor
4   Marmotb  Marmot
5   Bearxyz    Bear

Case insensitive

Here you should change the last two lines with

cond_list = [df["item_name"].str.lower()\
             .str.contains(animal.lower()) 
             for animal in animal_list]

df["animal"] = np.select(cond_list, animal_list)

edited Oct 5, 2021 at 12:04

answered Oct 5, 2021 at 1:16

rpanai

13.5k3 gold badges48 silver badges65 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

doubledribble Over a year ago

Thanks @rpanai. How would I go about ignoring case sensitivity?

rpanai Over a year ago

@doubledribble I added the example. Please consider reading the link I already posted How to Ask and minimal reproducible example. It will be better to ask just once instead to add subsequent requests.

rpanai · Accepted Answer · 2021-10-05 01:01:14Z

2

I think there's a cleaner way to write this, but it does what you want. If you are worried about case-sensitive, or full word matching, you'll have to modify this to your needs. Also, you don't need a np.array, just a list.

import io
import pandas as pd

data = '''item_name
Condor
Marmot
Bear
Condor a
Marmotb
Bearxyz
'''
df = pd.read_csv(io.StringIO(data), sep=' \s+', engine='python')
df

animal_list = ['Condor', 'Marmot','Bear','Pika','Rat','Racoon','Opossum']

def find_matches(x):
    for animal in animal_list:
        if animal in x['item_name']:
            return animal

df.apply(lambda x: find_matches(x), axis=1)

0    Condor
1    Marmot
2      Bear
3    Condor
4    Marmot
5      Bear
dtype: object

edited Oct 5, 2021 at 1:01

rpanai

13.5k3 gold badges48 silver badges65 bronze badges

answered Oct 5, 2021 at 0:45

Jonathan Leon

5,6862 gold badges9 silver badges16 bronze badges

3 Comments

rpanai Over a year ago

@doubledribble if the answer was useful please upvote it.

doubledribble Over a year ago

Sorry, this is close. I still want to keep the item_name column--I just want to add another column that shows the match.

rpanai Over a year ago

@doubledribble as long as you do df["animal"] = df.apply(lambda x: find_matches(x), axis=1) you have an extra column added to your original df.

Collectives™ on Stack Overflow

Finding a Substring in a Dataframe from a Numpy Array?

2 Answers 2

Data

Define conditions for numpy select

Case insensitive

2 Comments

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Data

Define conditions for numpy select

Case insensitive

2 Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related