1

I'm working on a regex to apply to a PySpark DataFrame column.

I can't happen to reproduce in PySpark the result of my working regex running here on regex101.

I've tried several methods (see below) and none seem to be working (even tested on specific Java regex's engine). I wish to fetch the group showed on the example above.

(\w+(?:\s*|\d*)\s+RUE\s.*)
[\s\-]+(\d*\s*RUE\s+.*)

Code sample:

df = spark.createDataFrame([
    ('RESIDENCE LA VENDEENNE 80 81 RUE LOUIS LUMIERE',)
], ["adresse1"])

df.withColumn("adresse1", regexp_replace("adresse1", "(\w+(?:\s*|\d*)\s+RUE\s.*)", '$1')).show(truncate=False)

The output I get is my unchanged column:

+-----------------------------------------------+
|adresse1                                       |
+-----------------------------------------------+
|RESIDENCE LA VENDEENNE 80  81 RUE LOUIS LUMIERE|
+-----------------------------------------------+

When I'm expecting the column to be valued at

81 RUE LOUIS LUMIERE

So far I have absolutely no guess, especially as my previous ones worked (matched) as predicted.


Spark config:

  • Version 2.4.0-cdh6.2.0
  • Scala version 2.11.12
  • OpenJDK 64-Bit Server VM, 1.8.0_222
2
  • 1
    Side note, you can simplify your pattern to \w+\d*\s+RUE\s.* Commented Nov 20, 2019 at 18:16
  • Indeed I replaced it, thanks! Commented Nov 20, 2019 at 18:42

1 Answer 1

2

I think you should be using regexp_extract instead of regexp_replace:

from pyspark.sql.functions import regexp_extract

df.withColumn(
    "adresse1", 
    regexp_extract("adresse1", r"(\w+(?:\s*|\d*)\s+RUE\s.*)", 1)
).show(truncate=False)
#+--------------------+
#|adresse1            |
#+--------------------+
#|81 RUE LOUIS LUMIERE|
#+--------------------+

To keep the column value unchanged if the pattern doesn't match, you can use pyspark.sql.Column.rlike and when:

from pyspark.sql.functions import col, when

pat = r"(\w+(?:\s*|\d*)\s+RUE\s.*)"

df.withColumn(
    "adresse1", 
    when(
        col("adresse1").rlike(pat), regexp_extract("adresse1", pat, 1)
    ).otherwise(col("adresse1"))
).show(truncate=False)
Sign up to request clarification or add additional context in comments.

6 Comments

Alright it totally works, thanks for pointing this out. Went too fast on the API documentation, and mistook the two of them.
A question occurs in my mind as I keep advancing on this subject @pault, how can I prevent the row value to be overwritten by an empty value if the regex didnt match? Do I have to use a 3rd column to store the regex extract and then replace the content of my original column only if the regex's output is not empty ?
Yes something like that. Another option would be to use rlike to test if the column matches the pattern first- I've posted an update.
This feels way more straightforward and readable (to me at least). Thanks again for this lightning fast answer. Im currently testing it.
Does it have a big impact on the physical schema ? Mine doesnt look like anything to me, but it surely seems like it's taking a significant amount of time. Despite still using native PySpark functions.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.