I have a Spark dataframe with 3k-4k columns and I'd like to drop columns where the name meets certain variable criteria ex. Where ColumnName Like 'foo'.
1 Answer
To get a column names you use df.columns and drop() supports dropping many columns in one call. The below code uses these two and does what you need:
condition = lambda col: 'foo' in col
new_df = df.drop(*filter(condition, df.columns))
3 Comments
DespicableMe
This absolutely solved my issue however I don't understand the syntax. filter I interpreted as any column containing '*foo' however that's not the case. foo seems to be treated as a substring i.e. *foo . Can you point to documentation that details this method? Thanks for the awesome help.
Mariusz
filter is builtin python method, than filters any iterable collection. You can find documentation here: docs.python.org/3/library/functions.html#filterSamuelNLP
you should not assign the lambda, just use:
new_df = df.drop(*filter(lambda col: 'foo' in col, df.columns))