1

I have a dataframe which can be generated from the code below

    df = pd.DataFrame({'person_id' :[1,2,3],'date1': ['12/31/2007','11/25/2009','10/06/2005'],'date1derived':[0,0,0],'val1':[2,4,6],'date2': ['12/31/2017','11/25/2019','10/06/2015'],'date2derived':[0,0,0],'val2':[1,3,5],'date3':['12/31/2027','11/25/2029','10/06/2025'],'date3derived':[0,0,0],'val3':[7,9,11]})

The dataframe looks like as shown below

enter image description here

I would like to remove columns that contain "derived" in their name. I tried different regex but couldn't get the expected output.

    df = df.filter(regex='[^H\dDerived]+', axis=1)
    df = df.filter(regex='[^Derived]',axis=1)

Can you let me know the right regex to do this?

5 Answers 5

2

You can use a zero-width negative lookahead to make sure the string derived does not come anywhere:

^(?!.*?derived)
  • ^ matches the start of the string
  • (?!.*?derived) is the negative lookahead pattern that makes sure derived does not come in the string

Your pattern [^Derived] will match any single character that are not one of D/e/r/i/v/e/d .

Sign up to request clarification or add additional context in comments.

Comments

2

IIUC, you want to drop columns has derived in it. This should do:

df.drop(df.filter(like='derived').columns, 1)

Out[455]:
   person_id       date1  val1       date2  val2       date3  val3
0          1  12/31/2007     2  12/31/2017     1  12/31/2027     7
1          2  11/25/2009     4  11/25/2019     3  11/25/2029     9
2          3  10/06/2005     6  10/06/2015     5  10/06/2025    11

2 Comments

Hi, Thanks for the response. It will drop all columns that only contain "derived" and it will not drop columns that contain "der". Am I right?
yes. It must has full word derived in its name to drop.
1

pd.Index.difference() with df.filter()

df[df.columns.difference(df.filter(like='derived').columns,sort=False)]

   person_id       date1  val1       date2  val2       date3  val3
0          1  12/31/2007     2  12/31/2017     1  12/31/2027     7
1          2  11/25/2009     4  11/25/2019     3  11/25/2029     9
2          3  10/06/2005     6  10/06/2015     5  10/06/2025    11

4 Comments

@AVLES df.filter(like='derived').columns gives you columns having derived , using pd.Index.difference(), we find the difference between df.columns and the filtered columns. sort=False to keep the original order. :)
Okay, usually when we do drop, the order of columns can change?
@AVLES no, it wouldn't, you can opt to drop() too. However, any algorithm which returns a copy of list would be faster , you will anyway slice the df.(IMO)
@AVLES i think you are looking for .dropna(axis=1) in this case
1
df[[c for c in df.columns if 'derived' not in c ]]

Output

   person_id       date1  val1       date2  val2       date3  val3
0          1  12/31/2007     2  12/31/2017     1  12/31/2027     7
1          2  11/25/2009     4  11/25/2019     3  11/25/2029     9
2          3  10/06/2005     6  10/06/2015     5  10/06/2025    11

4 Comments

A quick question. Let's say in my dataframe "date2" column was empty, how do I remove the column? I mean when values are NA, we usually drop rows(record), but if date2 column is all NA, I would like to drop "date2" column. Can you let me know how to do this?
pandas.pydata.org/pandas-docs/stable/reference/api/… will help, use how='all' with column name
I mean I would like to drop columns only if all values in them are "NA". iT SHOULD NOT drop columns for 1 0r 2 NA's
Yes how="all" is there for that purpose...go through the doc u will get it
1

In recent versions of pandas, you can use string methods on the index and columns. Here, str.endswith seems like a good fit.

import pandas as pd

df = pd.DataFrame({'person_id' :[1,2,3],'date1': ['12/31/2007','11/25/2009','10/06/2005'],
                   'date1derived':[0,0,0],'val1':[2,4,6],'date2': ['12/31/2017','11/25/2019','10/06/2015'],
                   'date2derived':[0,0,0],'val2':[1,3,5],'date3':['12/31/2027','11/25/2029','10/06/2025'],
                   'date3derived':[0,0,0],'val3':[7,9,11]})

df = df.loc[:,~df.columns.str.endswith('derived')]

print(df)

O/P:

   person_id       date1  val1       date2  val2       date3  val3
0          1  12/31/2007     2  12/31/2017     1  12/31/2027     7
1          2  11/25/2009     4  11/25/2019     3  11/25/2029     9
2          3  10/06/2005     6  10/06/2015     5  10/06/2025    11

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.