I would like to do the following in pyspark (for AWS Glue jobs):
JOIN a and b ON a.name = b.name AND a.number= b.number AND a.city LIKE b.city
So for example:
Table a:
| Number | Name | City |
|---|---|---|
| 1000 | Bob | % |
| 2000 | Joe | London |
Table b:
| Number | Name | City |
|---|---|---|
| 1000 | Bob | Boston |
| 1000 | Bob | Berlin |
| 2000 | Joe | Paris |
Results
| Number | Name | City |
|---|---|---|
| 1000 | Bob | Boston |
| 1000 | Bob | Berlin |
So the part I don't know how to do is to implement the wildcard "%" and use the LIKE operator. I know you can use .like() on strings, for example:
df.where(col('col1').like("%string%")).show()
But it expects a string, where in my case I would like to do it as a column. Something like the following:
result = a.join(
b,
(a.name == b.name) &
(a.number == b.number) &
(a.city.like(b.city)) # <-- This doesnt work since it is not a string
Any help to do this will be very appreciated!