Find intersection of two columns in Python Pandas -> list of strings

Question

I would like to count how many instances of column A and B intersect. The rows in Column A and B are lists of strings. For example, column A may contain [car, passenger, truck] and column B may contain [car, house, flower, truck]. Since in this case, 2 strings overlap, column C should display -> 2

I have tried (none of these work):

df['unique'] = np.unique(frame[['colA', 'colB']])

or

def unique(colA, colB):
    unique1 = list(set(colA) & set(colB))
    return unique1

df['unique'] = df.apply(unique, args=(df['colA'], frame['colB']))

TypeError: ('unique() takes 2 positional arguments but 3 were given', 'occurred at index article')

minimal reproducible example with reproducible code sample, please? — cs95
– cs95, Commented Apr 12, 2018 at 12:15
What exactly would you like me to add? I used the code above and provided the error. — Mia
– Mia, Commented Apr 12, 2018 at 12:18

jezrael · Accepted Answer · 2018-04-12 12:28:46Z

20

I believe need length with set.intersection in list comprehension:

df['C'] = [len(set(a).intersection(b)) for a, b in zip(df.A, df.B)]

Or:

df['C'] = [len(set(a) & set(b)) for a, b in zip(df.A, df.B)]

Sample:

df = pd.DataFrame(data={'A':[['car', 'passenger', 'truck'], ['car', 'truck']],
                        'B':[['car', 'house', 'flower', 'truck'], ['car', 'house']]})
print (df)
                         A                            B
0  [car, passenger, truck]  [car, house, flower, truck]
1             [car, truck]                 [car, house]

df['C'] = [len(set(a).intersection(b)) for a, b in zip(df.A, df.B)]
print (df)
                         A                            B  C
0  [car, passenger, truck]  [car, house, flower, truck]  2
1             [car, truck]                 [car, house]  1

edited Apr 12, 2018 at 12:28

answered Apr 12, 2018 at 12:18

jezrael

868k103 gold badges1.4k silver badges1.3k bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Alessandro Benedetti Over a year ago

Hi @jezrael, I was exploring your solution and functionally it works. But on big data frames it is not fast enough for my use case. I am new to Pandas, so do you think there is a possibility to speed this up, with some data manipulation ? I was thinking to transform the lists in Series of Series ( stack_query_time_categorical = only_categorical['A'].apply(pd.Series).stack().astype('category') ) But then I am struggling to calculate the intersections between them for all the values.

Marses Over a year ago

@AlessandroBenedetti One alternative method is based on what you tried to do with df.apply. You could have done: df.apply(lambda x: len(set(x["col1"]) & set(x["col2"])), axis=1). However, I tested it, and ths is actually slower than the list comprehension method in this answer (15ms vs 370ms on a long df). The problem is pandas isn't designed to do vectorised operations with arrays of objects such as lists. You might be hard-pressed to find a faster way to do this.

Collectives™ on Stack Overflow

Find intersection of two columns in Python Pandas -> list of strings

1 Answer 1

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related