0

I'm trying to read data from ScyllaDB and want to remove \n and \r character from a column. The problem is that these characters are stored as string in the column of a table being read and I need to use REGEX_REPLACE as I'm using Spark SQL for this. The regex pattern don't seem to work which work in MySQL. The string becomes blank but doesn't remove the characters. Below is the snippet of the query being used in Spark SQL. Help appreciated.

The following string is present in the message column: 'hello\nworld\r'

The expected output is 'hello world'

df=spark.sql("select  REGEXP_REPLACE(message,'\n|\r|\r\n',' ') as replaced_message from delivery_sms")
4
  • 1
    If you are literally trying to replace the string \n or \r, you need to escape the slash - regexp_replace(message, '\\n|\\r, ' ') Commented Jul 28, 2022 at 14:05
  • @Andrew , It is not working in case of pyspark sql but it works in mysql query. Can you please suggest any other way (another function) which I can apply that will be helpful Commented Jul 29, 2022 at 10:40
  • Also I have referred link stackoverflow.com/questions/56371701/… I tried with dataframe which I have read from scyllaDB but it is not working with that dataframe . But when I tried same example as it is given in link it works. If you can please let me know what would be the reason Commented Jul 29, 2022 at 11:32
  • Hrm, that's odd. I can't make it work in spark sql either. You can do it using regexp_replace and withColumn on the data frame though. You have to use 4 slashes for each - df.withColumn("test",regexp_replace("_c0","\\\\n|\\\\r"," ")).show(). Commented Jul 29, 2022 at 18:19

2 Answers 2

0

Thanks Andrew's for the answer.

The following worked for me:

df.withColumn("test",regexp_replace("_c0","\\\\n|\\\\r"," ")).show()
Sign up to request clarification or add additional context in comments.

1 Comment

Welcome to SO. The answer is very brief and can be improved by providing more details. It will help others understand. Please read stackoverflow.com/help/how-to-answer.
0

If anybody is looking for a modern solution to this problem:

df.select(
    F.translate(F.col("test"), "\n\t", " " * len("\n\t")).alias("test")
)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.