How to use LIKE operator as a JOIN condition in pyspark as a column

Question

I would like to do the following in pyspark (for AWS Glue jobs):

JOIN a and b ON a.name = b.name AND a.number= b.number AND a.city LIKE b.city

So for example:

Table a:

Number	Name	City
1000	Bob	%
2000	Joe	London

Table b:

Number	Name	City
1000	Bob	Boston
1000	Bob	Berlin
2000	Joe	Paris

Results

Number	Name	City
1000	Bob	Boston
1000	Bob	Berlin

So the part I don't know how to do is to implement the wildcard "%" and use the LIKE operator. I know you can use .like() on strings, for example:

df.where(col('col1').like("%string%")).show()

But it expects a string, where in my case I would like to do it as a column. Something like the following:

result = a.join(
    b,
    (a.name == b.name) &
    (a.number == b.number) &
    (a.city.like(b.city)) # <-- This doesnt work since it is not a string

Any help to do this will be very appreciated!

Does this answer your question? Pyspark DataFrame - using like function based on column name instead of String value — blackbishop
– blackbishop, Commented Mar 11, 2021 at 12:11
@blackbishop Thanks for the suggestion, now I know it does, but when I was searching I didn't know that expr() could be used as a condition in the join, so the answer below may be useful to someone like me :) — Coockson
– Coockson, Commented Mar 11, 2021 at 12:26
IMO it's not a duplicate because it involves a join here, and there is also a mistake in the joining expression. — mck
– mck, Commented Mar 11, 2021 at 13:05

mck · Accepted Answer · 2021-03-11 11:56:11Z

7

Try using an expression:

import pyspark.sql.functions as F

result = a.alias('a').join(
    b.alias('b'),
    (a.name == b.name) &
    (a.number == b.number) &
    F.expr("b.city like a.city")
)

I think you meant to do b like a rather than a like b because the % is in table a.

answered Mar 11, 2021 at 11:56

mck

42.7k13 gold badges44 silver badges62 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

lonewolf Over a year ago

For alias, it is not advised to use it in production. DataFrame.alias() is experimental since 1.5.0. Do not use it in production.

fuzzy-memory Jul 17 at 8:39

It's been around since 1.3.0. Link here

Jol Blazey · Accepted Answer · 2025-11-16 01:41:41Z

0

In addition to using the like expression there is another similar way:

.join( b.alias("b"), col("a.city").contains(col("b.city")) )

because the contains function also works as a join condition.

answered Nov 16 at 1:41

Jol Blazey

1

New contributor

Collectives™ on Stack Overflow

How to use LIKE operator as a JOIN condition in pyspark as a column

2 Answers 2

2 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related