Split DataFrame column based on regex expression with OR

Question

I have a dataframe with several columns of user information where I have the columns "Contact 1" and "Contact 2".

d= {'Contact 1': ['1234567891 1234567891', '12345678 12345678', '12345678 1234567891', '1234567891 12345678','1234567 1234567891',
          '1234567891','123456789 12345678911', None],
    'Contact 2': [None, None, None, None, None, '12345678', None, None]}

df = pd.DataFrame(data=d)

Contact 1	Contact 2
1234567891 1234567891	None
12345678 12345678	None
12345678 1234567891	None
1234567891 12345678	None
1234567 1234567891	None
1234567891	12345678
123456789 12345678911	None
None	None

I want to split the "Contact 1" column based on the space between numbers only if the contact numbers are 8 or 10 digits followed by space, then 8 or 10 digits. This while also preserving the few information I have on "Contact 2" column.

I tried the following code:


df[['Contact 1', 'Contact 2']]=df['Contact 1'].str.split(r'(?<=^\d{8}|\d{10})\s(?=\d{8}|\d{10}$)', n=1, expand=True)

but I get the error "re.error: look-behind requires fixed-width pattern"

I would like to get the following result:

Contact 1	Contact 2
1234567891	1234567891
12345678	12345678
12345678	1234567891
1234567891	12345678
1234567 1234567891	None
1234567891	12345678
123456789 12345678911	None
None	None

Could you please explain how line 12345678 1234567891 's Contact 2 column got value after processing? — RavinderSingh13
– RavinderSingh13, Commented Jun 9, 2021 at 4:17
Yes. Since 12345678 1234567891 is composed of a 8 or 10 digit number (in this case 8) followed by a whitespace, and then a number composed of 8 or 10 digits (in this case 10), the second number should be splitted to column 'Contact 2' — user16170404
– user16170404, Commented Jun 9, 2021 at 6:37

Tim Biegeleisen · Accepted Answer · 2021-06-09 05:38:51Z

2

Using str.extract:

df["Contact 2"] = np.where(df["Contact 2"].isnull(),
                           df["Contact 1"].str.extract(r'^\d{8,10} (\d{8,10})$'),
                           df["Contact 2"])

Also we need to update the first column:

df["Contact 1"] = df["Contact 1"].str.replace(r'^(\d{8,10}) \d{8,10}$', r'\1')

edited Jun 9, 2021 at 5:38

answered Jun 9, 2021 at 3:57

Tim Biegeleisen

526k32 gold badges323 silver badges399 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Chris Over a year ago

This doesn't produce the desired output.

Tim Biegeleisen Over a year ago

@Chris Now is using np.where to only assign a value to Contact 2 in the event that it is not already empty.

Chris Over a year ago

Still need to fix the first column!

user16170404 Over a year ago

Thank you! I just tried it and it works for every row except for the row that has a "Contact 2" value, where it gets overwritten to nan. I don't know why is that happening since the logic seems fine. I'm sorry I forgot to mention that instead of blanks, my real dataframe has None values. So I changed it to None as shown below

df["Contact 2"] =np.where(df["Contact 2"] == None, df["Contact 1"].str.extract(r'^\d{8,10} (\d{8,10})$'),df["Contact 2"])

but it doesn't work.

Tim Biegeleisen Over a year ago

@user16170404 Use isnull() on the column to check for None.

|

Chris · Accepted Answer · 2021-06-09 04:47:45Z

0

If you are interested in a non-regex solution:

Create a mask or rows that meet your conditions

m = df['Contact 1'].str.split().apply(lambda x: all([len(n) in [8,10] for n in x]))

Update df with the split/expanded values

df.update(df.loc[m]['Contact 1'].str.split(expand=True).rename(columns={0:'Contact 1',
                                                                        1:'Contact 2'}), overwrite=True)

answered Jun 9, 2021 at 4:47

Chris

16.3k3 gold badges26 silver badges41 bronze badges

2 Comments

user16170404 Over a year ago

Thank you! This worked perfectly for my small dataframe, but when I try to run your first line of code on my real dataframe of 75,000 rows I get the error "TypeError: 'NoneType' object is not iterable" which is weird since we just proved with the small dataframe that it works with None values.

user16170404 Over a year ago

I wrote the exception for None as follows m = df['Contact 1'].str.split().apply(lambda x: all([len(n) in [8,10] for n in x]) if x != None else False) and it works now :D Thank you!!

Collectives™ on Stack Overflow

Split DataFrame column based on regex expression with OR

2 Answers 2

6 Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

6 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related