1

I would like to implement the below SQL conditions in Pyspark

SELECT *
            FROM   table
            WHERE  NOT ( ID = 1
                         AND Event = 1 
                       ) 
               AND NOT ( ID = 2
                         AND Event = 2 
                       ) 
               AND NOT ( ID = 1 
                         AND Event = 0 
                       ) 
               AND NOT ( ID = 2
                         AND Event = 0 
                       ) 

What would be the clean way to do this?

2
  • If you want to created simple, i think you must define a UDF to run in your query Commented Jan 7, 2021 at 4:58
  • 1
    UDF is not necessary, it would just make the performance worse Commented Jan 7, 2021 at 5:01

2 Answers 2

2

you use filter or where function for DataFrame API version.

the equivalent code would be as follows :

df.filter(~((df.ID == 1) & (df.Event == 1)) & 
          ~((df.ID == 2) & (df.Event == 2)) & 
          ~((df.ID == 1) & (df.Event == 0)) &
          ~((df.ID == 2) & (df.Event == 0)))
Sign up to request clarification or add additional context in comments.

Comments

1

If you're lazy, you can just copy and paste the SQL filter expression into the pyspark filter:

df.filter("""
               NOT ( ID = 1
                         AND Event = 1 
                       ) 
               AND NOT ( ID = 2
                         AND Event = 2 
                       ) 
               AND NOT ( ID = 1 
                         AND Event = 0 
                       ) 
               AND NOT ( ID = 2
                         AND Event = 0 
                       ) 
""")

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.