I have a pyspark.sql dataframe that looks like this:
| id | name | refs |
|---|---|---|
| 1 | A | B, C ,D |
| 2 | B | A |
| 3 | C | A, B |
I'm trying to build a function that replaces the values of each array in ref by the corresponding ID of the name that it references and if there's no matching name in the Name column then it would ideally filter that value out or set it to null. The results would ideally look something like this:
| id | name | refs |
|---|---|---|
| 1 | A | 2, 3 |
| 2 | B | 1 |
| 3 | C | 1, 2 |
I tried doing this by defining a UDF that collects all names from the table and then obtains the indices of the intersection between each ref array and the set of all names. It works but is extremely slow, I'm sure there's probably better ways to do this using Spark and/or SQL.