How to write a python function that splits and selects first element in each pandas column

Question

I have a dataframe df with double entries separated by a , in some columns. I want to write a function to extract only the entry before the , for columns with double entries.

Example: 20.15,20.15 split to 20.15

See the dataframe

import pandas as pd
 
# initialize data of lists.
data = {'Name': ['Tom', 'nick', 'krish', 'jack','Phil','Shaq','Frank','Jerome','Arpan','Sean'],
        'Age': ['20.15,20.15', '21.02,21.02', '19.04,19.04','18.17,18.17','65.77,65.77','34.19,34.19','76.12,76.12','65.55,65.55','55.03,55.03','41.11,41.11'],
        'Score_1':['10,10', '21,21', '19,19','18,18','65,65','34,34','76,76','65,65','55,55','41,41'],
        'Score_2':['11,11', '31,31', '79,79','38,38','75,75','94,94','26,26','15,15','96,96','23,23'],
        'Score_3':['101,101', '212,212', '119,119','218,218','765,765','342,342','706,706','615,615','565,565','491,491'],
        'Type':[ 'A','C','D','F','B','E','H','G','J','K'],
        'bonus':['3.13,3.13','5.02,5.02','4.98,4.98','6.66,6.66','0.13,0.13','4.13,4.13','5.12,5.12','4.28,4.28','6.16,6.16','5.13,5.13'],
        'delta':[0.1,0.3,2.3,8.2,7.1,5.7,8.8,9.1,4.3,2.9]}
 
# Create DataFrame
df = pd.DataFrame(data)
 
# Print the output.
print(df)

Desired output (You can copy & paste)

# initialize data of lists.
df1 = {'Name': ['Tom', 'nick', 'krish', 'jack','Phil','Shaq','Frank','Jerome','Arpan','Sean'],
        'Age': ['20.15', '21.02', '19.04','18.17','65.77','34.19','76.12','65.55','55.03','41.11'],
        'Score_1':['10', '21', '19','18','65','34','76','65','55','41'],
        'Score_2':['11', '31', '79','38','75','94','26','15','96','23'],
        'Score_3':['101', '212', '119','218','765','342','706','615','565','491'],
        'Type':[ 'A','C','D','F','B','E','H','G','J','K'],
        'bonus':['3.13','5.02','4.98','6.66','0.13','4.13','5.12','4.28','6.16','5.13'],
        'delta':[0.1,0.3,2.3,8.2,7.1,5.7,8.8,9.1,4.3,2.9]}
 
# Create DataFrame
df2 = pd.DataFrame(df1)
 
# Print the output.
print(df2)

I need help with a more robust function, see my attempt below

def stringsplitter(data,column):
# select columns with object datatype
  data1 = data.select_dtypes(include=['object'])
  cols= data1[column].str.split(',', n=1).str
  print(cols[0])

# applying stringsplitter to the dataframe

final_df = df.apply(stringsplitter)

Thanks for your help

@rr_goyal, this only works for a column, in reality, can this be applied to 50+ columns without manually selecting them — nasa313
– nasa313, Commented Feb 10, 2024 at 1:33

e-motta · Accepted Answer · 2024-02-10 01:40:02Z

3

You can create the DataFrame and then edit the columns that have a comma. Note that this will only work if you're sure only the columns with duplicated data have commas in their values.

# Create DataFrame
df = pd.DataFrame(data)

for col in df.columns:
    if df[col].dtype == "object":
        df[col] = df[col].astype(str).str.split(",").str[0]

# Print the output.
print(df)

The result will be:

     Name    Age Score_1 Score_2 Score_3 Type bonus  delta
0     Tom  20.15      10      11     101    A  3.13    0.1
1    nick  21.02      21      31     212    C  5.02    0.3
2   krish  19.04      19      79     119    D  4.98    2.3
3    jack  18.17      18      38     218    F  6.66    8.2
4    Phil  65.77      65      75     765    B  0.13    7.1
5    Shaq  34.19      34      94     342    E  4.13    5.7
6   Frank  76.12      76      26     706    H  5.12    8.8
7  Jerome  65.55      65      15     615    G  4.28    9.1
8   Arpan  55.03      55      96     565    J  6.16    4.3
9    Sean  41.11      41      23     491    K  5.13    2.9

edited Feb 10, 2024 at 1:40

answered Feb 10, 2024 at 1:16

e-motta

7,5953 gold badges10 silver badges32 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

nasa313 Over a year ago

this works for this dataset, I was thinking along the lines of a code that can be reusable regardless of the columns, hence my attempt to here : data1 = data.select_dtypes(include=['object'])

e-motta Over a year ago

You can do that if you're sure only the columns with duplicated data have commas in their values. I edited the answer accordingly.

rhug123 · Accepted Answer · 2024-02-10 03:49:05Z

You can use replace with regex:

df = df.replace(r'([0-9.]+),.*',r'\1',regex=True)

Output:

    Name    Age Score_1 Score_2 Score_3 Type bonus  delta
0     Tom  20.15      10      11     101    A  3.13    0.1
1    nick  21.02      21      31     212    C  5.02    0.3
2   krish  19.04      19      79     119    D  4.98    2.3
3    jack  18.17      18      38     218    F  6.66    8.2
4    Phil  65.77      65      75     765    B  0.13    7.1
5    Shaq  34.19      34      94     342    E  4.13    5.7
6   Frank  76.12      76      26     706    H  5.12    8.8
7  Jerome  65.55      65      15     615    G  4.28    9.1
8   Arpan  55.03      55      96     565    J  6.16    4.3
9    Sean  41.11      41      23     491    K  5.13    2.9

Collectives™ on Stack Overflow

How to write a python function that splits and selects first element in each pandas column

2 Answers 2

2 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related