PySpark regex engine not matching

Question

I'm working on a regex to apply to a PySpark DataFrame column.

I can't happen to reproduce in PySpark the result of my working regex running here on regex101.

I've tried several methods (see below) and none seem to be working (even tested on specific Java regex's engine). I wish to fetch the group showed on the example above.

(\w+(?:\s*|\d*)\s+RUE\s.*)
[\s\-]+(\d*\s*RUE\s+.*)

Code sample:

df = spark.createDataFrame([
    ('RESIDENCE LA VENDEENNE 80 81 RUE LOUIS LUMIERE',)
], ["adresse1"])

df.withColumn("adresse1", regexp_replace("adresse1", "(\w+(?:\s*|\d*)\s+RUE\s.*)", '$1')).show(truncate=False)

The output I get is my unchanged column:

+-----------------------------------------------+
|adresse1                                       |
+-----------------------------------------------+
|RESIDENCE LA VENDEENNE 80  81 RUE LOUIS LUMIERE|
+-----------------------------------------------+

When I'm expecting the column to be valued at

81 RUE LOUIS LUMIERE

So far I have absolutely no guess, especially as my previous ones worked (matched) as predicted.

Spark config:

Version 2.4.0-cdh6.2.0
Scala version 2.11.12
OpenJDK 64-Bit Server VM, 1.8.0_222

Side note, you can simplify your pattern to \w+\d*\s+RUE\s.* — ctwheels
– ctwheels, Commented Nov 20, 2019 at 18:16

pault · Accepted Answer · 2019-11-20 19:12:04Z

2

I think you should be using regexp_extract instead of regexp_replace:

from pyspark.sql.functions import regexp_extract

df.withColumn(
    "adresse1", 
    regexp_extract("adresse1", r"(\w+(?:\s*|\d*)\s+RUE\s.*)", 1)
).show(truncate=False)
#+--------------------+
#|adresse1            |
#+--------------------+
#|81 RUE LOUIS LUMIERE|
#+--------------------+

To keep the column value unchanged if the pattern doesn't match, you can use pyspark.sql.Column.rlike and when:

from pyspark.sql.functions import col, when

pat = r"(\w+(?:\s*|\d*)\s+RUE\s.*)"

df.withColumn(
    "adresse1", 
    when(
        col("adresse1").rlike(pat), regexp_extract("adresse1", pat, 1)
    ).otherwise(col("adresse1"))
).show(truncate=False)

edited Nov 20, 2019 at 19:12

answered Nov 20, 2019 at 18:17

pault

43.7k17 gold badges121 silver badges161 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Amesys Over a year ago

Alright it totally works, thanks for pointing this out. Went too fast on the API documentation, and mistook the two of them.

Amesys Over a year ago

A question occurs in my mind as I keep advancing on this subject @pault, how can I prevent the row value to be overwritten by an empty value if the regex didnt match? Do I have to use a 3rd column to store the regex extract and then replace the content of my original column only if the regex's output is not empty ?

pault Over a year ago

Yes something like that. Another option would be to use rlike to test if the column matches the pattern first- I've posted an update.

Amesys Over a year ago

This feels way more straightforward and readable (to me at least). Thanks again for this lightning fast answer. Im currently testing it.

Amesys Over a year ago

Does it have a big impact on the physical schema ? Mine doesnt look like anything to me, but it surely seems like it's taking a significant amount of time. Despite still using native PySpark functions.

|

Collectives™ on Stack Overflow

PySpark regex engine not matching

1 Answer 1

6 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

6 Comments

Your Answer

Sign up or log in

Post as a guest

Related