How can I split Pandas dataframe column with strings according to multiple conditions

Question

I have a pandas dataframe look like this:

    ID       Col.A

28654      This is a dark chocolate which is sweet 
39876      Sky is blue 1234 Sky is cloudy 3423
88776      Stars can be seen in the dark sky
35491      Schools are closed 4568 but shops are open

I tried to split Col.A before the word dark or the digits. My desired result is as given below.

     ID             Col.A                             Col.B
    
    28654      This is a                  dark chocolate which is sweet 
    39876      Sky is blue                1234 Sky is cloudy 3423
    88776      Stars can be seen in the   dark sky
    35491      Schools are closed         4568 but shops are open

I tried to group the rows which contains the word dark to a dataframe and group the rows with digits to another dataframe and then split them accordingly. After that I can concatenate the resulting dataframes to obtain expected result. The code is as given below:

df = pd.DataFrame({'ID':[28654,39876,88776,35491], 'Col.A':['This is a dark chocolate which is sweet', 
                                                            'Sky is blue 1234 Sky is cloudy 3423', 
                                                            'Stars can be seen in the dark sky',
                                                            'Schools are closed 4568 but shops are open']})

df1 = df[df['Col.A'].str.contains(' dark ')==True]
df2 = df.merge(df1,indicator = True, how='left').loc[lambda x : x['_merge']!='both']
df1 = df1["Col.A"].str.split(' dark ', expand = True)
df2 = df2["Col.A"].str.split('\d+', expand = True)
pd.concat([[df1, df2], axis =0)

The obtained result is different from the one expected. that is,

      0                              1
0   This is a                   chocolate which is sweet
2   Stars can be seen in the     sky    
1   Sky is blue                  Sky is cloudy  
3   Schools are closed           but shops are open

I missed the digits in the string and the word dark in the result.

So how can I solve this issue and get result without missing the splitting word and digits?

Is there any way to "slice before expected word or digits" without removing them?

Shubham Sharma · Accepted Answer · 2021-04-27 17:39:58Z

7

`Series.str.split`

s = df['Col.A'].str.split(r'\s+(?=\b(?:dark|\d+)\b)', n=1, expand=True)
df[['ID']].join(s.set_axis(['Col.A', 'Col.B'], 1))

      ID                     Col.A                          Col.B
0  28654                 This is a  dark chocolate which is sweet
1  39876               Sky is blue        1234 Sky is cloudy 3423
2  88776  Stars can be seen in the                       dark sky
3  35491        Schools are closed        4568 but shops are open

Regex details:

\s+ : Matches any whitespace character one or more time
(?=\b(?:dark|\d+)\b) : Positive Lookahead
- \b : Word boundary to prevent partial matches
- (?:dark|\d+): Non capturing group
  - dark : First Alternative matches the characters dark literally
  - \d+ : Second alternative which matches any digit one or more times
- \b : Word boundary to prevent partial matches

See the online regex demo

edited Apr 27, 2021 at 17:39

answered Apr 27, 2021 at 17:21

Shubham Sharma

71.8k6 gold badges26 silver badges58 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Athul R T Over a year ago

That's cool. What if I have dark and darkest in the same line and I need to split before dark only? Is there any way for it?

Shubham Sharma Over a year ago

@AthulRT Yes we could do that. I've edited the answer.

RavinderSingh13 · Accepted Answer · 2021-04-27 17:48:40Z

4

With your shown samples, please try following. Using str.extract function of Pandas. Simple explanation would be using extract function and mentioning regex to create 1st capturing group with non-greedy match and 2nd group has digits OR dark string till last of line and saving it into Col.A and Col.B columns.

df[["Col.A","Col.B"]] = df['Col.A'].str.extract(r'(.*?)((?:dark|\d+).*)', expand=True)
df

With shown samples output will be as follows:

    ID      Col.A                       Col.B
0   28654   This is a                   dark chocolate which is sweet
1   39876   Sky is blue                 1234 Sky is cloudy 3423
2   88776   Stars can be seen in the    dark sky
3   35491   Schools are closed          4568 but shops are open

edited Apr 27, 2021 at 17:48

answered Apr 27, 2021 at 17:42

RavinderSingh13

135k14 gold badges61 silver badges100 bronze badges

Comments

Andrej Kesely · Accepted Answer · 2021-04-27 17:21:24Z

3

df[["Col.A", "Col.B"]] = df["Col.A"].str.split(
    r"\s*(dark.*|\d.*)", n=1, expand=True
)[[0, 1]]
print(df)

Prints:

      ID                     Col.A                          Col.B
0  28654                 This is a  dark chocolate which is sweet
1  39876               Sky is blue        1234 Sky is cloudy 3423
2  88776  Stars can be seen in the                       dark sky
3  35491        Schools are closed        4568 but shops are open

answered Apr 27, 2021 at 17:21

Andrej Kesely

196k15 gold badges60 silver badges105 bronze badges

Collectives™ on Stack Overflow

How can I split Pandas dataframe column with strings according to multiple conditions

3 Answers 3

`Series.str.split`

2 Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Series.str.split

2 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related

`Series.str.split`