Filter spark dataframe with multiple conditions on multiple columns in Pyspark

Question

I would like to implement the below SQL conditions in Pyspark

SELECT *
            FROM   table
            WHERE  NOT ( ID = 1
                         AND Event = 1 
                       ) 
               AND NOT ( ID = 2
                         AND Event = 2 
                       ) 
               AND NOT ( ID = 1 
                         AND Event = 0 
                       ) 
               AND NOT ( ID = 2
                         AND Event = 0 
                       )

What would be the clean way to do this?

If you want to created simple, i think you must define a UDF to run in your query — Kenry Sanchez
– Kenry Sanchez, Commented Jan 7, 2021 at 4:58
UDF is not necessary, it would just make the performance worse — Anand Vidvat
– Anand Vidvat, Commented Jan 7, 2021 at 5:01

Anand Vidvat · Accepted Answer · 2021-01-07 04:59:36Z

2

you use filter or where function for DataFrame API version.

the equivalent code would be as follows :

df.filter(~((df.ID == 1) & (df.Event == 1)) & 
          ~((df.ID == 2) & (df.Event == 2)) & 
          ~((df.ID == 1) & (df.Event == 0)) &
          ~((df.ID == 2) & (df.Event == 0)))

answered Jan 7, 2021 at 4:59

Anand Vidvat

1,0889 silver badges25 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

mck · Accepted Answer · 2021-01-07 07:50:29Z

1

If you're lazy, you can just copy and paste the SQL filter expression into the pyspark filter:

df.filter("""
               NOT ( ID = 1
                         AND Event = 1 
                       ) 
               AND NOT ( ID = 2
                         AND Event = 2 
                       ) 
               AND NOT ( ID = 1 
                         AND Event = 0 
                       ) 
               AND NOT ( ID = 2
                         AND Event = 0 
                       ) 
""")

answered Jan 7, 2021 at 7:50

mck

42.7k13 gold badges44 silver badges62 bronze badges

Collectives™ on Stack Overflow

Filter spark dataframe with multiple conditions on multiple columns in Pyspark

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related