7

I would like to do the following in pyspark (for AWS Glue jobs):

JOIN a and b ON a.name = b.name AND a.number= b.number AND a.city LIKE b.city

So for example:

Table a:

Number Name City
1000 Bob %
2000 Joe London

Table b:

Number Name City
1000 Bob Boston
1000 Bob Berlin
2000 Joe Paris

Results

Number Name City
1000 Bob Boston
1000 Bob Berlin

So the part I don't know how to do is to implement the wildcard "%" and use the LIKE operator. I know you can use .like() on strings, for example:

df.where(col('col1').like("%string%")).show()

But it expects a string, where in my case I would like to do it as a column. Something like the following:

result = a.join(
    b,
    (a.name == b.name) &
    (a.number == b.number) &
    (a.city.like(b.city)) # <-- This doesnt work since it is not a string

Any help to do this will be very appreciated!

3
  • Does this answer your question? Pyspark DataFrame - using like function based on column name instead of String value Commented Mar 11, 2021 at 12:11
  • @blackbishop Thanks for the suggestion, now I know it does, but when I was searching I didn't know that expr() could be used as a condition in the join, so the answer below may be useful to someone like me :) Commented Mar 11, 2021 at 12:26
  • IMO it's not a duplicate because it involves a join here, and there is also a mistake in the joining expression. Commented Mar 11, 2021 at 13:05

2 Answers 2

7

Try using an expression:

import pyspark.sql.functions as F

result = a.alias('a').join(
    b.alias('b'),
    (a.name == b.name) &
    (a.number == b.number) &
    F.expr("b.city like a.city")
)

I think you meant to do b like a rather than a like b because the % is in table a.

Sign up to request clarification or add additional context in comments.

2 Comments

For alias, it is not advised to use it in production. DataFrame.alias() is experimental since 1.5.0. Do not use it in production.
It's been around since 1.3.0. Link here
0

In addition to using the like expression there is another similar way:

.join( b.alias("b"), col("a.city").contains(col("b.city")) )

because the contains function also works as a join condition.

New contributor
Jol Blazey is a new contributor to this site. Take care in asking for clarification, commenting, and answering. Check out our Code of Conduct.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.