I have a dataframe that contains a string column with text of varied lengths, then I have an array column where each element is a struct with specified word, index, start position and end position in the text column. I want to replace words in the text column, that is in the array.
It looks like this:
- id:integer
- text:string
- text_entity:array
- element:struct
- word:string
- index:integer
- start:integer
- end:integer
text example could be:
"I talked with Christian today at Cafe Heimdal last Wednesday"
text_entity example could be:
[{"word": "Christian", "index":4, "start":14, "end":23}, {"word": "Heimdal", "index":8, "start":38, "end":45}]
I then want to change the text to have the words at the above indexes replaced to:
"I talked with (BLEEP) today at Cafe (BLEEP) last Wednesday"
My initial approach was to explode the array and then do a regex_replace, but then there is the problem of collecting the text and merging them. And it seems like it would take a lot of operations. And I would like to not use UDFs, as performance is quite important. regex_replace also has the problem that it might match sub-strings, and that would not be okay. Therefore ideally the index, start, or end is used.