2

I have a dataframe df with double entries separated by a , in some columns. I want to write a function to extract only the entry before the , for columns with double entries.

Example: 20.15,20.15 split to 20.15

See the dataframe

import pandas as pd
 
# initialize data of lists.
data = {'Name': ['Tom', 'nick', 'krish', 'jack','Phil','Shaq','Frank','Jerome','Arpan','Sean'],
        'Age': ['20.15,20.15', '21.02,21.02', '19.04,19.04','18.17,18.17','65.77,65.77','34.19,34.19','76.12,76.12','65.55,65.55','55.03,55.03','41.11,41.11'],
        'Score_1':['10,10', '21,21', '19,19','18,18','65,65','34,34','76,76','65,65','55,55','41,41'],
        'Score_2':['11,11', '31,31', '79,79','38,38','75,75','94,94','26,26','15,15','96,96','23,23'],
        'Score_3':['101,101', '212,212', '119,119','218,218','765,765','342,342','706,706','615,615','565,565','491,491'],
        'Type':[ 'A','C','D','F','B','E','H','G','J','K'],
        'bonus':['3.13,3.13','5.02,5.02','4.98,4.98','6.66,6.66','0.13,0.13','4.13,4.13','5.12,5.12','4.28,4.28','6.16,6.16','5.13,5.13'],
        'delta':[0.1,0.3,2.3,8.2,7.1,5.7,8.8,9.1,4.3,2.9]}
 
# Create DataFrame
df = pd.DataFrame(data)
 
# Print the output.
print(df)

Desired output (You can copy & paste)

# initialize data of lists.
df1 = {'Name': ['Tom', 'nick', 'krish', 'jack','Phil','Shaq','Frank','Jerome','Arpan','Sean'],
        'Age': ['20.15', '21.02', '19.04','18.17','65.77','34.19','76.12','65.55','55.03','41.11'],
        'Score_1':['10', '21', '19','18','65','34','76','65','55','41'],
        'Score_2':['11', '31', '79','38','75','94','26','15','96','23'],
        'Score_3':['101', '212', '119','218','765','342','706','615','565','491'],
        'Type':[ 'A','C','D','F','B','E','H','G','J','K'],
        'bonus':['3.13','5.02','4.98','6.66','0.13','4.13','5.12','4.28','6.16','5.13'],
        'delta':[0.1,0.3,2.3,8.2,7.1,5.7,8.8,9.1,4.3,2.9]}
 
# Create DataFrame
df2 = pd.DataFrame(df1)
 
# Print the output.
print(df2)

I need help with a more robust function, see my attempt below

def stringsplitter(data,column):
# select columns with object datatype
  data1 = data.select_dtypes(include=['object'])
  cols= data1[column].str.split(',', n=1).str
  print(cols[0])

# applying stringsplitter to the dataframe

final_df = df.apply(stringsplitter)

Thanks for your help

2
  • Try using df["Age"] = df["Age"].str.split(",").str[0] Commented Feb 10, 2024 at 1:12
  • @rr_goyal, this only works for a column, in reality, can this be applied to 50+ columns without manually selecting them Commented Feb 10, 2024 at 1:33

2 Answers 2

3

You can create the DataFrame and then edit the columns that have a comma. Note that this will only work if you're sure only the columns with duplicated data have commas in their values.

# Create DataFrame
df = pd.DataFrame(data)

for col in df.columns:
    if df[col].dtype == "object":
        df[col] = df[col].astype(str).str.split(",").str[0]

# Print the output.
print(df)

The result will be:

     Name    Age Score_1 Score_2 Score_3 Type bonus  delta
0     Tom  20.15      10      11     101    A  3.13    0.1
1    nick  21.02      21      31     212    C  5.02    0.3
2   krish  19.04      19      79     119    D  4.98    2.3
3    jack  18.17      18      38     218    F  6.66    8.2
4    Phil  65.77      65      75     765    B  0.13    7.1
5    Shaq  34.19      34      94     342    E  4.13    5.7
6   Frank  76.12      76      26     706    H  5.12    8.8
7  Jerome  65.55      65      15     615    G  4.28    9.1
8   Arpan  55.03      55      96     565    J  6.16    4.3
9    Sean  41.11      41      23     491    K  5.13    2.9
Sign up to request clarification or add additional context in comments.

2 Comments

this works for this dataset, I was thinking along the lines of a code that can be reusable regardless of the columns, hence my attempt to here : data1 = data.select_dtypes(include=['object'])
You can do that if you're sure only the columns with duplicated data have commas in their values. I edited the answer accordingly.
1

You can use replace with regex:

df = df.replace(r'([0-9.]+),.*',r'\1',regex=True)

Output:

    Name    Age Score_1 Score_2 Score_3 Type bonus  delta
0     Tom  20.15      10      11     101    A  3.13    0.1
1    nick  21.02      21      31     212    C  5.02    0.3
2   krish  19.04      19      79     119    D  4.98    2.3
3    jack  18.17      18      38     218    F  6.66    8.2
4    Phil  65.77      65      75     765    B  0.13    7.1
5    Shaq  34.19      34      94     342    E  4.13    5.7
6   Frank  76.12      76      26     706    H  5.12    8.8
7  Jerome  65.55      65      15     615    G  4.28    9.1
8   Arpan  55.03      55      96     565    J  6.16    4.3
9    Sean  41.11      41      23     491    K  5.13    2.9

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.