1

How do I find a list of substrings in a dataframe against an array and use the array values to create a new column? For example, I started off using str.contains and typing out the actual string value(see below).

import pandas as pd
import numpy as np

#Filepath directory
csv_report = filepath

#Creates dataframe of CSV report
csv_df = pd.read_csv(csv_report)
  
csv_df['animal'] = np.where(csv_df.item_name.str.contains('Condor'), "Condor",
                   np.where(csv_df.item_name.str.contains('Marmot'), "Marmot",
                   np.where(csv_df.item_name.str.contains('Bear'),"Bear",
                   np.where(csv_df.item_name.str.contains('Pika'),"Pika",
                   np.where(csv_df.item_name.str.contains('Rat'),"Rat",
                   np.where(csv_df.item_name.str.contains('Racoon'),"Racoon",
                   np.where(csv_df.item_name.str.contains('Opossum'),"Opossum")))))))

How would I go about achieving the above code if the string values are in an array instead? Sample below:

import pandas as pd
import numpy as np

#Filepath directory
csv_report = filepath

#Creates dataframe of CSV report
csv_df = pd.read_csv(csv_report)

animal_list = np.array(['Condor', 'Marmot','Bear','Pika','Rat','Racoon','Opossum'])
2
  • I'm not sure that using an array here helps. A list of strings is just as good. Pandas doesn't use numpy string dtypes for its strings. Columns with strings are object dtype, with Python strings. Use string methods and pandas own string enhancements. Commented Oct 5, 2021 at 0:45
  • Hi and welcome on SO. It will be great if you can have a look at How to Ask and then try to produce a minimal reproducible example. In this case @Jonathan Leon was kind enough to produce an example for you. But you should always try to write data for your question. Commented Oct 5, 2021 at 1:20

2 Answers 2

2

There is a better way than using apply or several np.where. Have a look at np.select. Here as on the other answer we are assuming that each row can have only one match

Data

Stolen from @Jonathan Leon

import pandas as pd
import numpy as np
data = ['Condor', 
        'Marmot',
        'Bear',
        'Condor a',
        'Marmotb',
        'Bearxyz']

df = pd.DataFrame(data, columns=["item_name"])

animal_list = ['Condor', 
               'Marmot',
               'Bear',
               'Pika',
               'Rat',
               'Racoon',
               'Opossum']

Define conditions for numpy select

cond_list = [df["item_name"].str.contains(animal) 
             for animal in animal_list]

df["animal"] = np.select(cond_list, animal_list)

output


  item_name  animal
0    Condor  Condor
1    Marmot  Marmot
2      Bear    Bear
3  Condor a  Condor
4   Marmotb  Marmot
5   Bearxyz    Bear

Case insensitive

Here you should change the last two lines with

cond_list = [df["item_name"].str.lower()\
             .str.contains(animal.lower()) 
             for animal in animal_list]

df["animal"] = np.select(cond_list, animal_list)
Sign up to request clarification or add additional context in comments.

2 Comments

Thanks @rpanai. How would I go about ignoring case sensitivity?
@doubledribble I added the example. Please consider reading the link I already posted How to Ask and minimal reproducible example. It will be better to ask just once instead to add subsequent requests.
2

I think there's a cleaner way to write this, but it does what you want. If you are worried about case-sensitive, or full word matching, you'll have to modify this to your needs. Also, you don't need a np.array, just a list.

import io
import pandas as pd

data = '''item_name
Condor
Marmot
Bear
Condor a
Marmotb
Bearxyz
'''
df = pd.read_csv(io.StringIO(data), sep=' \s+', engine='python')
df

animal_list = ['Condor', 'Marmot','Bear','Pika','Rat','Racoon','Opossum']

def find_matches(x):
    for animal in animal_list:
        if animal in x['item_name']:
            return animal

df.apply(lambda x: find_matches(x), axis=1)

0    Condor
1    Marmot
2      Bear
3    Condor
4    Marmot
5      Bear
dtype: object

3 Comments

@doubledribble if the answer was useful please upvote it.
Sorry, this is close. I still want to keep the item_name column--I just want to add another column that shows the match.
@doubledribble as long as you do df["animal"] = df.apply(lambda x: find_matches(x), axis=1) you have an extra column added to your original df.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.