1

I have a pyspark.sql dataframe that looks like this:

id name refs
1 A B, C ,D
2 B A
3 C A, B

I'm trying to build a function that replaces the values of each array in ref by the corresponding ID of the name that it references and if there's no matching name in the Name column then it would ideally filter that value out or set it to null. The results would ideally look something like this:

id name refs
1 A 2, 3
2 B 1
3 C 1, 2

I tried doing this by defining a UDF that collects all names from the table and then obtains the indices of the intersection between each ref array and the set of all names. It works but is extremely slow, I'm sure there's probably better ways to do this using Spark and/or SQL.

1 Answer 1

3

You can explode the arrays, do a self-join using the exploded ref and name, and collect the joined ids back to an array using collect_list.

import pyspark.sql.functions as F

df1 = df.select('id', 'name', F.explode('refs').alias('refs'))
df2 = df.toDF('id2', 'name2', 'refs2')

result = df1.join(df2, df1.refs == df2.name2) \
            .select('id', 'name', 'id2') \
            .groupBy('id', 'name') \
            .agg(F.collect_list('id2').alias('refs'))

result.show()
+---+----+------+
| id|name|  refs|
+---+----+------+
|  1|   A|[2, 3]|
|  2|   B|   [1]|
|  3|   C|[1, 2]|
+---+----+------+
Sign up to request clarification or add additional context in comments.

1 Comment

I had to replace collect_list by collect_set to avoid duplicated id values. Otherwise this did perfectly. Thanks a lot

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.