1

I have a dataframe that contains a string column with text of varied lengths, then I have an array column where each element is a struct with specified word, index, start position and end position in the text column. I want to replace words in the text column, that is in the array.

It looks like this:

- id:integer
- text:string
- text_entity:array
  - element:struct
    - word:string
    - index:integer
    - start:integer
    - end:integer

text example could be:

"I talked with Christian today at Cafe Heimdal last Wednesday"

text_entity example could be:

[{"word": "Christian", "index":4, "start":14, "end":23}, {"word": "Heimdal", "index":8, "start":38, "end":45}]

I then want to change the text to have the words at the above indexes replaced to:

"I talked with (BLEEP) today at Cafe (BLEEP) last Wednesday"

My initial approach was to explode the array and then do a regex_replace, but then there is the problem of collecting the text and merging them. And it seems like it would take a lot of operations. And I would like to not use UDFs, as performance is quite important. regex_replace also has the problem that it might match sub-strings, and that would not be okay. Therefore ideally the index, start, or end is used.

2 Answers 2

1

Use aggregate function on text_entity array with splitted text column as the initial value like this:

from pyspark.sql import functions as F

jsonSting = """{"id":1,"text":"I talked with Christian today at Cafe Heimdal last Wednesday","text_entity":[{"word":"Christian","index":4,"start":14,"end":23},{"word":"Heimdal","index":8,"start":38,"end":45}]}"""
df = spark.read.json(spark.sparkContext.parallelize([jsonSting]))

df1 = df.withColumn(
    "text",
    F.array_join(
        F.expr(r"""aggregate(
                  text_entity, 
                  split(text, " "), 
                  (acc, x) -> transform(acc, (y, i) -> IF(i=x.index, '(BLEEP)', y))
           )"""),
        " "
    )
)

df1.show(truncate=False)
#+---+----------------------------------------------------------+----------------------------------------------+
#|id |text                                                      |text_entity                                   |
#+---+----------------------------------------------------------+----------------------------------------------+
#|1  |I talked with (BLEEP) today at Cafe (BLEEP) last Wednesday|[{23, 4, 14, Christian}, {45, 8, 38, Heimdal}]|
#+---+----------------------------------------------------------+----------------------------------------------+
Sign up to request clarification or add additional context in comments.

2 Comments

Do you think its possible to use the position to replace the words instead? Problem here being that the name Christian might appear multiple times in the text, but should only be 'bleeped' in one instance, not all of them.
@cenh then you can use array positions to replace the element at index. by first splitting the text column by space to get an array column then use aggregate. Please see above update.
1

I came up with this answer using regexp_replace. Problem with using regex_replace however is that it will replace all occurrences, which is not the intention as a word could appear multiple time in the text, and only some of the occurrences should be bleeped

df = df.withColumn("temp_entities", F.expr(f"transform(text_entity, (x, i) -> x.word)")) \
    .withColumn("temp_entities", F.array_distinct("temp_entities")) \
    .withColumn("regex_expression", F.concat_ws("|", "temp_entities")) \
    .withColumn("regex_expression", F.concat(F.lit("\\b("), F.col("regex_expression"), F.lit(")\\b"))) \
    .withColumn("text", F.when(F.size("text_entity") > 0, F.expr("regexp_replace(text, regex_expression, '(BLEEP)')")).otherwise(F.col(text)))

It removes duplicates, and only applies regexp_replace if there are at least 1 entity. Probably not the most elegant solution, and will bleep all occurrences of the word. Ideally the position should be used.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.