Use Regex to filter Columns (by name) of a PySpark dataframe

Question

I have a Spark dataframe with 3k-4k columns and I'd like to drop columns where the name meets certain variable criteria ex. Where ColumnName Like 'foo'.

Mariusz · Accepted Answer · 2017-09-15 18:25:53Z

8

To get a column names you use df.columns and drop() supports dropping many columns in one call. The below code uses these two and does what you need:

condition = lambda col: 'foo' in col
new_df = df.drop(*filter(condition, df.columns))

answered Sep 15, 2017 at 18:25

Mariusz

14k3 gold badges66 silver badges66 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

DespicableMe Over a year ago

This absolutely solved my issue however I don't understand the syntax. filter I interpreted as any column containing '*foo' however that's not the case. foo seems to be treated as a substring i.e. *foo . Can you point to documentation that details this method? Thanks for the awesome help.

Mariusz Over a year ago

filter is builtin python method, than filters any iterable collection. You can find documentation here: docs.python.org/3/library/functions.html#filter

SamuelNLP Over a year ago

you should not assign the lambda, just use: new_df = df.drop(*filter(lambda col: 'foo' in col, df.columns))

Collectives™ on Stack Overflow

Use Regex to filter Columns (by name) of a PySpark dataframe

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related