1

My dataframe looks like:

School Term Students
A summer 2020 324
B spring 21 101
A summer/spring 201
F wintersem 44
C fall trimester 98
E 23

I need to add a new column Termcode that assumes any of the 6 values: summer, spring, fall, winter, multiple, none based on corresponding value in the Term Column, viz:

School Term Students Termcode
A summer 2020 324 summer
B spring 21 101 spring
A summer/spring 201 multiple
F wintersem 44 winter
C fall trimester 98 fall
E 23 none
1
  • What have you tried so far? Commented Jun 2, 2022 at 11:36

2 Answers 2

2

You can use a regex with str.extractall and filling of the values depending on the number of matches:

terms = ['summer', 'spring', 'fall', 'winter']
regex = r'('+'|'.join(terms)+r')'
# '(summer|spring|fall|winter)'

# extract values and set up grouper for next step
g = df['Term'].str.extractall(regex)[0].groupby(level=0)

# get the first match, replace with "multiple" if more than one
df['Termcode'] = g.first().mask(g.nunique().gt(1), 'multiple')

# fill the missing data (i.e. no match) with "none"
df['Termcode'] = df['Termcode'].fillna('none')

output:

  School            Term  Students  Termcode
0      A     summer 2020       324    summer
1      B       spring 21       101    spring
2      A   summer/spring       201  multiple
3      F       wintersem        44    winter
4      C  fall trimester        98      fall
5      E             NaN        23      none
Sign up to request clarification or add additional context in comments.

5 Comments

Worked like clockwork, thanks! I Would you suggest any modification to account for another no match possibility where Term value is something like say, "full year 21' ( i.e., no match for the one or more of the 4 terms , but not a missing value either) . The Termcode output should still be none, like in the case of missing values.
I would probably add full year or year in the list of terms and handle it separately? depends on what you want exactly
School/Term /Students as F /full year2021/ 40 should be converted to School/Term /Students/Termcode as F /full year201/ 40 /none
Then I believe it should already work
Indeed it does! Thanks you again.
1

Series.findall

l = ['summer', 'spring', 'fall', 'winter']

s = df['Term'].str.findall(fr"{'|'.join(l)}")
df['Termcode'] = np.where(s.str.len() > 1, 'multiple', s.str[0])

  School            Term  Students  Termcode
0      A     summer 2020       324    summer
1      B       spring 21       101    spring
2      A   summer/spring       201  multiple
3      F       wintersem        44    winter
4      C  fall trimester        98      fall
5      E             NaN        23       NaN

1 Comment

I had a similar approach in mind initially, unfortunately this would count as "multiple" those where there are more that one identical match (e.g., "fall/fall trimester"). This might be unlikely but better be aware of it ;)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.