Replace values of each array in pyspark dataframe array column by their corresponding ids

Question

I have a pyspark.sql dataframe that looks like this:

id	name	refs
1	A	B, C ,D
2	B	A
3	C	A, B

I'm trying to build a function that replaces the values of each array in ref by the corresponding ID of the name that it references and if there's no matching name in the Name column then it would ideally filter that value out or set it to null. The results would ideally look something like this:

id	name	refs
1	A	2, 3
2	B	1
3	C	1, 2

I tried doing this by defining a UDF that collects all names from the table and then obtains the indices of the intersection between each ref array and the set of all names. It works but is extremely slow, I'm sure there's probably better ways to do this using Spark and/or SQL.

mck · Accepted Answer · 2021-01-10 12:11:34Z

3

You can explode the arrays, do a self-join using the exploded ref and name, and collect the joined ids back to an array using collect_list.

import pyspark.sql.functions as F

df1 = df.select('id', 'name', F.explode('refs').alias('refs'))
df2 = df.toDF('id2', 'name2', 'refs2')

result = df1.join(df2, df1.refs == df2.name2) \
            .select('id', 'name', 'id2') \
            .groupBy('id', 'name') \
            .agg(F.collect_list('id2').alias('refs'))

result.show()
+---+----+------+
| id|name|  refs|
+---+----+------+
|  1|   A|[2, 3]|
|  2|   B|   [1]|
|  3|   C|[1, 2]|
+---+----+------+

answered Jan 10, 2021 at 12:11

mck

42.7k13 gold badges44 silver badges62 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

simonsanvil Over a year ago

I had to replace collect_list by collect_set to avoid duplicated id values. Otherwise this did perfectly. Thanks a lot

Collectives™ on Stack Overflow

Replace values of each array in pyspark dataframe array column by their corresponding ids

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related