I have a pandas dataframe look like this:
ID Col.A
28654 This is a dark chocolate which is sweet
39876 Sky is blue 1234 Sky is cloudy 3423
88776 Stars can be seen in the dark sky
35491 Schools are closed 4568 but shops are open
I tried to split Col.A before the word dark or the digits. My desired result is as given below.
ID Col.A Col.B
28654 This is a dark chocolate which is sweet
39876 Sky is blue 1234 Sky is cloudy 3423
88776 Stars can be seen in the dark sky
35491 Schools are closed 4568 but shops are open
I tried to group the rows which contains the word dark to a dataframe and group the rows with digits to another dataframe and then split them accordingly. After that I can concatenate the resulting dataframes to obtain expected result. The code is as given below:
df = pd.DataFrame({'ID':[28654,39876,88776,35491], 'Col.A':['This is a dark chocolate which is sweet',
'Sky is blue 1234 Sky is cloudy 3423',
'Stars can be seen in the dark sky',
'Schools are closed 4568 but shops are open']})
df1 = df[df['Col.A'].str.contains(' dark ')==True]
df2 = df.merge(df1,indicator = True, how='left').loc[lambda x : x['_merge']!='both']
df1 = df1["Col.A"].str.split(' dark ', expand = True)
df2 = df2["Col.A"].str.split('\d+', expand = True)
pd.concat([[df1, df2], axis =0)
The obtained result is different from the one expected. that is,
0 1
0 This is a chocolate which is sweet
2 Stars can be seen in the sky
1 Sky is blue Sky is cloudy
3 Schools are closed but shops are open
I missed the digits in the string and the word dark in the result.
So how can I solve this issue and get result without missing the splitting word and digits?
Is there any way to "slice before expected word or digits" without removing them?