How can I use REGEX_REPLACE in pyspark SQL to remove \n and \r from column

Question

I'm trying to read data from ScyllaDB and want to remove \n and \r character from a column. The problem is that these characters are stored as string in the column of a table being read and I need to use REGEX_REPLACE as I'm using Spark SQL for this. The regex pattern don't seem to work which work in MySQL. The string becomes blank but doesn't remove the characters. Below is the snippet of the query being used in Spark SQL. Help appreciated.

The following string is present in the message column: 'hello\nworld\r'

The expected output is 'hello world'

df=spark.sql("select  REGEXP_REPLACE(message,'\n|\r|\r\n',' ') as replaced_message from delivery_sms")

If you are literally trying to replace the string \n or \r, you need to escape the slash - regexp_replace(message, '\\n|\\r, ' ') — Andrew
– Andrew, Commented Jul 28, 2022 at 14:05
@Andrew , It is not working in case of pyspark sql but it works in mysql query. Can you please suggest any other way (another function) which I can apply that will be helpful — ash
– ash, Commented Jul 29, 2022 at 10:40
Also I have referred link stackoverflow.com/questions/56371701/… I tried with dataframe which I have read from scyllaDB but it is not working with that dataframe . But when I tried same example as it is given in link it works. If you can please let me know what would be the reason — ash
– ash, Commented Jul 29, 2022 at 11:32
Hrm, that's odd. I can't make it work in spark sql either. You can do it using regexp_replace and withColumn on the data frame though. You have to use 4 slashes for each - df.withColumn("test",regexp_replace("_c0","\\\\n|\\\\r"," ")).show(). — Andrew
– Andrew, Commented Jul 29, 2022 at 18:19

ash · Accepted Answer · 2022-08-02 09:00:07Z

0

Thanks Andrew's for the answer.

The following worked for me:

df.withColumn("test",regexp_replace("_c0","\\\\n|\\\\r"," ")).show()

answered Aug 2, 2022 at 9:00

ash

11 silver badge4 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Azhar Khan Over a year ago

Welcome to SO. The answer is very brief and can be improved by providing more details. It will help others understand. Please read stackoverflow.com/help/how-to-answer.

ciurlaro · Accepted Answer · 2022-11-08 14:58:27Z

0

If anybody is looking for a modern solution to this problem:

df.select(
    F.translate(F.col("test"), "\n\t", " " * len("\n\t")).alias("test")
)

answered Nov 8, 2022 at 14:58

ciurlaro

1,03216 silver badges25 bronze badges

Collectives™ on Stack Overflow

How can I use REGEX_REPLACE in pyspark SQL to remove \n and \r from column

2 Answers 2

1 Comment

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related