Use regex to remove/exclude columns from dataframe - Python

Question

I have a dataframe which can be generated from the code below

    df = pd.DataFrame({'person_id' :[1,2,3],'date1': ['12/31/2007','11/25/2009','10/06/2005'],'date1derived':[0,0,0],'val1':[2,4,6],'date2': ['12/31/2017','11/25/2019','10/06/2015'],'date2derived':[0,0,0],'val2':[1,3,5],'date3':['12/31/2027','11/25/2029','10/06/2025'],'date3derived':[0,0,0],'val3':[7,9,11]})

The dataframe looks like as shown below

I would like to remove columns that contain "derived" in their name. I tried different regex but couldn't get the expected output.

    df = df.filter(regex='[^H\dDerived]+', axis=1)
    df = df.filter(regex='[^Derived]',axis=1)

Can you let me know the right regex to do this?

heemayl · Accepted Answer · 2019-06-28 12:03:32Z

2

You can use a zero-width negative lookahead to make sure the string derived does not come anywhere:

^(?!.*?derived)

^ matches the start of the string
(?!.*?derived) is the negative lookahead pattern that makes sure derived does not come in the string

Your pattern [^Derived] will match any single character that are not one of D/e/r/i/v/e/d .

answered Jun 28, 2019 at 12:03

heemayl

42.5k10 gold badges86 silver badges87 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Andy L. · Accepted Answer · 2019-06-28 12:04:13Z

2

IIUC, you want to drop columns has derived in it. This should do:

df.drop(df.filter(like='derived').columns, 1)

Out[455]:
   person_id       date1  val1       date2  val2       date3  val3
0          1  12/31/2007     2  12/31/2017     1  12/31/2027     7
1          2  11/25/2009     4  11/25/2019     3  11/25/2029     9
2          3  10/06/2005     6  10/06/2015     5  10/06/2025    11

answered Jun 28, 2019 at 12:04

Andy L.

25.3k4 gold badges20 silver badges30 bronze badges

2 Comments

The Great Over a year ago

Hi, Thanks for the response. It will drop all columns that only contain "derived" and it will not drop columns that contain "der". Am I right?

Andy L. Over a year ago

yes. It must has full word derived in its name to drop.

anky · Accepted Answer · 2019-06-28 12:06:31Z

1

pd.Index.difference() with df.filter()

df[df.columns.difference(df.filter(like='derived').columns,sort=False)]

   person_id       date1  val1       date2  val2       date3  val3
0          1  12/31/2007     2  12/31/2017     1  12/31/2027     7
1          2  11/25/2009     4  11/25/2019     3  11/25/2029     9
2          3  10/06/2005     6  10/06/2015     5  10/06/2025    11

answered Jun 28, 2019 at 12:06

anky

75.3k11 gold badges46 silver badges76 bronze badges

4 Comments

anky Over a year ago

@AVLES df.filter(like='derived').columns gives you columns having derived , using pd.Index.difference(), we find the difference between df.columns and the filtered columns. sort=False to keep the original order. :)

The Great Over a year ago

Okay, usually when we do drop, the order of columns can change?

anky Over a year ago

@AVLES no, it wouldn't, you can opt to drop() too. However, any algorithm which returns a copy of list would be faster , you will anyway slice the df.(IMO)

anky Over a year ago

@AVLES i think you are looking for .dropna(axis=1) in this case

iamklaus · Accepted Answer · 2019-06-28 12:01:25Z

1

df[[c for c in df.columns if 'derived' not in c ]]

Output

   person_id       date1  val1       date2  val2       date3  val3
0          1  12/31/2007     2  12/31/2017     1  12/31/2027     7
1          2  11/25/2009     4  11/25/2019     3  11/25/2029     9
2          3  10/06/2005     6  10/06/2015     5  10/06/2025    11

answered Jun 28, 2019 at 12:01

iamklaus

3,7682 gold badges14 silver badges21 bronze badges

4 Comments

The Great Over a year ago

A quick question. Let's say in my dataframe "date2" column was empty, how do I remove the column? I mean when values are NA, we usually drop rows(record), but if date2 column is all NA, I would like to drop "date2" column. Can you let me know how to do this?

iamklaus Over a year ago

pandas.pydata.org/pandas-docs/stable/reference/api/… will help, use how='all' with column name

The Great Over a year ago

I mean I would like to drop columns only if all values in them are "NA". iT SHOULD NOT drop columns for 1 0r 2 NA's

iamklaus Over a year ago

Yes how="all" is there for that purpose...go through the doc u will get it

bharatk · Accepted Answer · 2019-06-28 12:11:36Z

In recent versions of pandas, you can use string methods on the index and columns. Here, str.endswith seems like a good fit.

import pandas as pd

df = pd.DataFrame({'person_id' :[1,2,3],'date1': ['12/31/2007','11/25/2009','10/06/2005'],
                   'date1derived':[0,0,0],'val1':[2,4,6],'date2': ['12/31/2017','11/25/2019','10/06/2015'],
                   'date2derived':[0,0,0],'val2':[1,3,5],'date3':['12/31/2027','11/25/2029','10/06/2025'],
                   'date3derived':[0,0,0],'val3':[7,9,11]})

df = df.loc[:,~df.columns.str.endswith('derived')]

print(df)

O/P:

   person_id       date1  val1       date2  val2       date3  val3
0          1  12/31/2007     2  12/31/2017     1  12/31/2027     7
1          2  11/25/2009     4  11/25/2019     3  11/25/2029     9
2          3  10/06/2005     6  10/06/2015     5  10/06/2025    11

Collectives™ on Stack Overflow

Use regex to remove/exclude columns from dataframe - Python

5 Answers 5

Comments

2 Comments

4 Comments

4 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

Comments

2 Comments

4 Comments

4 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related