4

I have two address columns and I want to extract the last word from the first column and the first word from the second column. In the provided example there aren't two words in column 'Address2', but I want to build the code in such a way that it will work regardless of how the dataset will look like. Sometimes the address2 can be one word, something it will have 2, etc..

data = {
    'Address1': ['3 Steel Street', '1 Arnprior Crescent', '40 Bargeddie Street Blackhill'],
    'Address2': ['Saltmarket', 'Castlemilk', 'Blackhill']
}

df = pd.DataFrame(data)

I have no problem with column 'Address1':

df[['StringStart', 'LastWord']] = df['Address1'].str.rsplit(' ', n=1, expand=True)

The problem comes with column 'Address2' where if I apply the above code I an error: Columns must be same length as key

I understand where the problem is coming from - I am trying to split one column which has one element into two columns. I am sure there is a way in which this can be handled to allow the split anyway and return Null if there isn't a word and a value if there is.

3 Answers 3

3

Using str.extract() might be better for several reasons: it handles all cases, offers precision with regular expressions, and eliminates the risk of value errors.

import pandas as pd

data = {
    'Address1': ['3 Steel Street', '1 Arnprior Crescent', '40 Bargeddie Street Blackhill'],
    'Address2': ['Saltmarket', 'Castlemilk East', 'Blackhill']
}
df = pd.DataFrame(data)

df[['StringStart', 'LastWord']] = df['Address1'].str.rsplit(' ', n=1, expand=True)

df[['FirstWord_Address2', 'Remaining_Address2']] = (
    df['Address2'].str.extract(r'^(\S+)\s*(.*)$')
)

print(df)

Or:

df[['Address1_Prefix', 'Address1_LastWord']] = df['Address1'].str.extract(r'^(.*\b)\s+(\S+)$')

df[['Address2_FirstWord', 'Address2_Remaining']] = df['Address2'].str.extract(r'^(\S+)\s*(.*)$')

Output:

                        Address1         Address2          StringStart   LastWord FirstWord_Address2 Remaining_Address2
0                 3 Steel Street       Saltmarket              3 Steel     Street         Saltmarket
1            1 Arnprior Crescent  Castlemilk East           1 Arnprior   Crescent         Castlemilk               East
2  40 Bargeddie Street Blackhill        Blackhill  40 Bargeddie Street  Blackhill          Blackhill
Sign up to request clarification or add additional context in comments.

4 Comments

Hi, thanks for this - it worked! Could you please explain me what does this bit do (r'^(\S+)\s*(.*)$') ? These are regular expressions, but what doe we actually say in the brackets? Also, where do we specify that we want the first word in the columns? I want to understand this, so that I can handle it in the future
^(\S+) = Grabs the first word (e.g., "Saltmarket" from "Saltmarket", or "Castlemilk" from "Castlemilk East"). \s(.)** = Matches the whitespace (if any), then captures the remaining text (e.g., "East" in "Castlemilk East"). If no text follows, it captures an empty string. This is saved in the Remaining column. Please don't hesitate to ask me more if I am not clear. Thank you. @MariaT
Hi, thanks for this. I am also wondering - if ^ symbol marks the start of the string, how are we not exporting the first word from column Address1 in here: df['Address1'].str.extract(r'^(.*\b)\s+(\S+)$')
^ matches the start of "3 Steel Street". "str.extract(r'^(.*\b)\s+(\S+)$')" is not omitting the first word from the string — it’s actually trying to split the string into two parts. Group 1: (.*\b) – everything up to the last word boundary before the last word. Group 2: (\S+) – the last word (non-whitespace characters at the end). Thank you. @MariaT
2

I would use df.apply() with a custom function.

This is a straightforward example.

import numpy as np
from functools import partial

def split_addresses(row, col):
    r = row[col].split(' ')
    if len(r) < 2:
        first_word = " ".join(r)
        last_word = np.nan
    else:
        first_word = " ".join(r[:-1])
        last_word = r[-1]
    return first_word, last_word

_fun = partial(split_addresses, col='Address2') #chose which columns you want to process

splits = df.apply(_fun, axis=1)
df["StringStart"] = pd.Series([s[0] for s in splits])
df["StringEnd"] = pd.Series([s[1] for s in splits])

print(df)

                        Address1    Address2 StringStart   LastWord  StringEnd
0                 3 Steel Street  Saltmarket  Saltmarket     Street        NaN
1            1 Arnprior Crescent  Castlemilk  Castlemilk   Crescent        NaN
2  40 Bargeddie Street Blackhill   Blackhill   Blackhill  Blackhill        NaN

Comments

1

TL;DR

You can use .reindex to add missing columns:

import pandas as pd

(
    pd.Series(['Hello', 'world'])
      .str.split(n=1, expand=True)
      .reindex(pd.RangeIndex(2), axis=1)
)
       0   1
0  Hello NaN
1  world NaN

With expand=True both Series.str.split and .rsplit will return a pd.DataFrame with a default pd.RangeIndex. Hence, with n=1, the result has either one column (0) or two (0, 1, or: pd.RangeIndex(n+1)).

Realizing this, you can use df.reindex with axis=1 to ensure a consistent number of output columns. Missing columns get added with NaN values. Here's a wrapper:

def split_expand(series, n=1, rsplit=False):
    splitter = series.str.rsplit if rsplit else series.str.split
    result = splitter(n=n, expand=True)
    if result.shape[1] < n+1:
        return result.reindex(pd.RangeIndex(n+1), axis=1)
    return result

df[['StringStart', 'LastWord']] = split_expand(df['Address1'], rsplit=True)
df[['FirstWord', 'StringEnd']] = split_expand(df['Address2'])

Output:

                        Address1    Address2          StringStart   LastWord  \
0                 3 Steel Street  Saltmarket              3 Steel     Street   
1            1 Arnprior Crescent  Castlemilk           1 Arnprior   Crescent   
2  40 Bargeddie Street Blackhill   Blackhill  40 Bargeddie Street  Blackhill   

    FirstWord  StringEnd  
0  Saltmarket        NaN  
1  Castlemilk        NaN  
2   Blackhill        NaN   

1 Comment

Hi, thanks for providing a different solution which such detailed explanation!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.