Hello people of StackOverflow
I'm trying to extract a number from house number but for some reason I cannot.
I have a working code from teradata that I'm trying to convert to pyspark:
--Original code from teradata:
CAST(REGEXP_SUBSTR(adr_house_no, '\d+') AS INTEGER) AS adr_house_no,
REGEXP_SUBSTR(adr_house_no, '[A-Za-z]+$') AS adr_house_no_ad
Here is the query I'm using:
result = spark.sql('''
SELECT
adr_house_no as house_no,
CAST(regexp_extract(adr_house_no, '(\d+)') AS INT) as adr_house_no,
regexp_extract(adr_house_no, '([A-Za-z]+$)') as adr_house_no_ad
FROM subscriber_info_address_subscriber
''').show()
The result is as follows:
+--------+------------+---------------+
|house_no|adr_house_no|adr_house_no_ad|
+--------+------------+---------------+
| LTECXYD| null| LTECXYD|
| LTECXYD| null| LTECXYD|
| 51l| null| l|
| 84J| null| J|
| 96t| null| t|
| 919| null| |
| 59e| null| e|
| 919| null| |
| LTECXYD| null| LTECXYD|
| 67s| null| s|
| 4-6| null| |
| 361| null| |
| LTECXYD| null| LTECXYD|
| LTECXYD| null| LTECXYD|
| LTECXYD| null| LTECXYD|
| 842| null| |
| LTECXYD| null| LTECXYD|
| 98r| null| r|
| 361| null| |
| LTECXYD| null| LTECXYD|
+--------+------------+---------------+
The part of extracting house letter works but for some reason I cannot mark any digit. I tried selecting one digit \d or two.
I tried regexp_extract(adr_house_no, '\d+') without parenthesis but it also doesn't work.
What does work is regexp_extract(adr_house_no, '[0-9]+')
Why is that? Why doesn't \d works in pyspark?
regexp_extract(adr_house_no, '\\d+')does?