14

I would like to count how many instances of column A and B intersect. The rows in Column A and B are lists of strings. For example, column A may contain [car, passenger, truck] and column B may contain [car, house, flower, truck]. Since in this case, 2 strings overlap, column C should display -> 2

I have tried (none of these work):

df['unique'] = np.unique(frame[['colA', 'colB']])

or

def unique(colA, colB):
    unique1 = list(set(colA) & set(colB))
    return unique1

df['unique'] = df.apply(unique, args=(df['colA'], frame['colB']))

TypeError: ('unique() takes 2 positional arguments but 3 were given', 'occurred at index article')

2
  • 1
    minimal reproducible example with reproducible code sample, please? Commented Apr 12, 2018 at 12:15
  • 1
    What exactly would you like me to add? I used the code above and provided the error. Commented Apr 12, 2018 at 12:18

1 Answer 1

20

I believe need length with set.intersection in list comprehension:

df['C'] = [len(set(a).intersection(b)) for a, b in zip(df.A, df.B)]

Or:

df['C'] = [len(set(a) & set(b)) for a, b in zip(df.A, df.B)]

Sample:

df = pd.DataFrame(data={'A':[['car', 'passenger', 'truck'], ['car', 'truck']],
                        'B':[['car', 'house', 'flower', 'truck'], ['car', 'house']]})
print (df)
                         A                            B
0  [car, passenger, truck]  [car, house, flower, truck]
1             [car, truck]                 [car, house]

df['C'] = [len(set(a).intersection(b)) for a, b in zip(df.A, df.B)]
print (df)
                         A                            B  C
0  [car, passenger, truck]  [car, house, flower, truck]  2
1             [car, truck]                 [car, house]  1
Sign up to request clarification or add additional context in comments.

2 Comments

Hi @jezrael, I was exploring your solution and functionally it works. But on big data frames it is not fast enough for my use case. I am new to Pandas, so do you think there is a possibility to speed this up, with some data manipulation ? I was thinking to transform the lists in Series of Series ( stack_query_time_categorical = only_categorical['A'].apply(pd.Series).stack().astype('category') ) But then I am struggling to calculate the intersections between them for all the values.
@AlessandroBenedetti One alternative method is based on what you tried to do with df.apply. You could have done: df.apply(lambda x: len(set(x["col1"]) & set(x["col2"])), axis=1). However, I tested it, and ths is actually slower than the list comprehension method in this answer (15ms vs 370ms on a long df). The problem is pandas isn't designed to do vectorised operations with arrays of objects such as lists. You might be hard-pressed to find a faster way to do this.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.