0

I have a dataframe with several columns of user information where I have the columns "Contact 1" and "Contact 2".

d= {'Contact 1': ['1234567891 1234567891', '12345678 12345678', '12345678 1234567891', '1234567891 12345678','1234567 1234567891',
          '1234567891','123456789 12345678911', None],
    'Contact 2': [None, None, None, None, None, '12345678', None, None]}

df = pd.DataFrame(data=d)
Contact 1 Contact 2
1234567891 1234567891 None
12345678 12345678 None
12345678 1234567891 None
1234567891 12345678 None
1234567 1234567891 None
1234567891 12345678
123456789 12345678911 None
None None

I want to split the "Contact 1" column based on the space between numbers only if the contact numbers are 8 or 10 digits followed by space, then 8 or 10 digits. This while also preserving the few information I have on "Contact 2" column.

I tried the following code:


df[['Contact 1', 'Contact 2']]=df['Contact 1'].str.split(r'(?<=^\d{8}|\d{10})\s(?=\d{8}|\d{10}$)', n=1, expand=True)

but I get the error "re.error: look-behind requires fixed-width pattern"

I would like to get the following result:

Contact 1 Contact 2
1234567891 1234567891
12345678 12345678
12345678 1234567891
1234567891 12345678
1234567 1234567891 None
1234567891 12345678
123456789 12345678911 None
None None
2
  • Could you please explain how line 12345678 1234567891 's Contact 2 column got value after processing? Commented Jun 9, 2021 at 4:17
  • Yes. Since 12345678 1234567891 is composed of a 8 or 10 digit number (in this case 8) followed by a whitespace, and then a number composed of 8 or 10 digits (in this case 10), the second number should be splitted to column 'Contact 2' Commented Jun 9, 2021 at 6:37

2 Answers 2

2

Using str.extract:

df["Contact 2"] = np.where(df["Contact 2"].isnull(),
                           df["Contact 1"].str.extract(r'^\d{8,10} (\d{8,10})$'),
                           df["Contact 2"])

Also we need to update the first column:

df["Contact 1"] = df["Contact 1"].str.replace(r'^(\d{8,10}) \d{8,10}$', r'\1')
Sign up to request clarification or add additional context in comments.

6 Comments

This doesn't produce the desired output.
@Chris Now is using np.where to only assign a value to Contact 2 in the event that it is not already empty.
Still need to fix the first column!
Thank you! I just tried it and it works for every row except for the row that has a "Contact 2" value, where it gets overwritten to nan. I don't know why is that happening since the logic seems fine. I'm sorry I forgot to mention that instead of blanks, my real dataframe has None values. So I changed it to None as shown below df["Contact 2"] =np.where(df["Contact 2"] == None, df["Contact 1"].str.extract(r'^\d{8,10} (\d{8,10})$'),df["Contact 2"]) but it doesn't work.
@user16170404 Use isnull() on the column to check for None.
|
0

If you are interested in a non-regex solution:

Create a mask or rows that meet your conditions

m = df['Contact 1'].str.split().apply(lambda x: all([len(n) in [8,10] for n in x]))

Update df with the split/expanded values

df.update(df.loc[m]['Contact 1'].str.split(expand=True).rename(columns={0:'Contact 1',
                                                                        1:'Contact 2'}), overwrite=True)

2 Comments

Thank you! This worked perfectly for my small dataframe, but when I try to run your first line of code on my real dataframe of 75,000 rows I get the error "TypeError: 'NoneType' object is not iterable" which is weird since we just proved with the small dataframe that it works with None values.
I wrote the exception for None as follows m = df['Contact 1'].str.split().apply(lambda x: all([len(n) in [8,10] for n in x]) if x != None else False) and it works now :D Thank you!!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.