I'm working on a regex to apply to a PySpark DataFrame column.
I can't happen to reproduce in PySpark the result of my working regex running here on regex101.
I've tried several methods (see below) and none seem to be working (even tested on specific Java regex's engine). I wish to fetch the group showed on the example above.
(\w+(?:\s*|\d*)\s+RUE\s.*)
[\s\-]+(\d*\s*RUE\s+.*)
Code sample:
df = spark.createDataFrame([
('RESIDENCE LA VENDEENNE 80 81 RUE LOUIS LUMIERE',)
], ["adresse1"])
df.withColumn("adresse1", regexp_replace("adresse1", "(\w+(?:\s*|\d*)\s+RUE\s.*)", '$1')).show(truncate=False)
The output I get is my unchanged column:
+-----------------------------------------------+
|adresse1 |
+-----------------------------------------------+
|RESIDENCE LA VENDEENNE 80 81 RUE LOUIS LUMIERE|
+-----------------------------------------------+
When I'm expecting the column to be valued at
81 RUE LOUIS LUMIERE
So far I have absolutely no guess, especially as my previous ones worked (matched) as predicted.
Spark config:
- Version 2.4.0-cdh6.2.0
- Scala version 2.11.12
- OpenJDK 64-Bit Server VM, 1.8.0_222
\w+\d*\s+RUE\s.*