Extract data from a pandas dataframe strings column and generate new columns based on content in it

Question

I have a pandas column which has data like this :

**Title **: New_ind

**Body **: Detection_error

**respo_status **: {color}

data = {'sl no': [661, 662],
        'key': ['3484', '3483'],
        'id': [13592349, 13592490],
        'Sum': ['[E-1]', '[E-1]'],
        'Desc': [
              "**Title **: New_ind\n\n**Body **: Detection_error\n\n*respo_URL **: www.github.com\n\n**respo_status **: {yellow}","**Title **: New_ind2\n\n**Body **: import_error\n\n*respo_URL **: \n\n**respo_status **: {green}"]}

df = pd.DataFrame(data)

I need to generate new columns where Title, Body, response_URL, etc would be column names and everything after : should be the value contained in those column cells. Just to mention the items in the column are not dictionaries

@Timus They are actually not dictionary which is why I am having a problem — Ashish
– Ashish, Commented Jan 30, 2023 at 16:18
@Timus I have added two rows of the data here, Please let me know if this is sufficient. Thanks — Ashish
– Ashish, Commented Jan 31, 2023 at 4:11

Timus · Accepted Answer · 2023-01-31 10:20:41Z

There are various ways to do that with regex but I found this with str-methods to be the clearest:

desc_df = df["Desc"].str.split("\n\n", expand=True)
for col in desc_df.columns:
    desc_df[col] = desc_df[col].str.split(":").str[1].str.strip()
colnames = "Title", "Body", "respo_URL", "respo_status"
desc_df = desc_df.rename(columns=dict(enumerate(colnames)))
df = pd.concat([df.drop(columns="Desc"), desc_df], axis=1)

First split column Desc at \n\n and expand the result into a dataframe desc_df.
Then split each new column at :, take the right side, and strip whitespace.
Finally change the column names and concat the initial dataframe without the Desc column and desc_df.

Result for the sample:

   sl no   key        id    Sum     Title             Body       respo_URL  \
0    661  3484  13592349  [E-1]   New_ind  Detection_error  www.github.com   
1    662  3483  13592490  [E-1]  New_ind2     import_error                   

  respo_status  
0     {yellow}  
1      {green}

The following regex-version worked for the sample, but I think it's not as robust the other one:

pattern = "\n\n".join(
    f"\*+{col} \*+: (?P<{col}>[^\n]*)"
    for col in ("Title", "Body", "respo_URL", "respo_status")    
)
desc_df = df["Desc"].str.extract(pattern)
df = pd.concat([df.drop(columns="Desc"), desc_df], axis=1)

Collectives™ on Stack Overflow

Extract data from a pandas dataframe strings column and generate new columns based on content in it

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related