4

Hello people of StackOverflow
I'm trying to extract a number from house number but for some reason I cannot.
I have a working code from teradata that I'm trying to convert to pyspark:

--Original code from teradata:
CAST(REGEXP_SUBSTR(adr_house_no, '\d+') AS INTEGER) AS adr_house_no, 
REGEXP_SUBSTR(adr_house_no, '[A-Za-z]+$') AS adr_house_no_ad

Here is the query I'm using:

result = spark.sql('''

    SELECT
        adr_house_no as house_no,
        CAST(regexp_extract(adr_house_no, '(\d+)') AS INT) as adr_house_no,
        regexp_extract(adr_house_no, '([A-Za-z]+$)') as adr_house_no_ad
    FROM subscriber_info_address_subscriber

    ''').show()

The result is as follows:

+--------+------------+---------------+
|house_no|adr_house_no|adr_house_no_ad|
+--------+------------+---------------+
| LTECXYD|        null|        LTECXYD|
| LTECXYD|        null|        LTECXYD|
|     51l|        null|              l|
|     84J|        null|              J|
|     96t|        null|              t|
|     919|        null|               |
|     59e|        null|              e|
|     919|        null|               |
| LTECXYD|        null|        LTECXYD|
|     67s|        null|              s|
|     4-6|        null|               |
|     361|        null|               |
| LTECXYD|        null|        LTECXYD|
| LTECXYD|        null|        LTECXYD|
| LTECXYD|        null|        LTECXYD|
|     842|        null|               |
| LTECXYD|        null|        LTECXYD|
|     98r|        null|              r|
|     361|        null|               |
| LTECXYD|        null|        LTECXYD|
+--------+------------+---------------+

The part of extracting house letter works but for some reason I cannot mark any digit. I tried selecting one digit \d or two.
I tried regexp_extract(adr_house_no, '\d+') without parenthesis but it also doesn't work.
What does work is regexp_extract(adr_house_no, '[0-9]+')
Why is that? Why doesn't \d works in pyspark?

2
  • Because regexp_extract(adr_house_no, '\\d+') does? Commented Jan 29, 2020 at 15:20
  • It also doesn't work :/ Commented Jan 29, 2020 at 15:24

3 Answers 3

6

actually \d is supported in sql format but it requires adding r before the string and double back slash, for example

result = spark.sql(r'''

    SELECT
        adr_house_no as house_no,
        CAST(regexp_extract(adr_house_no, '(\\d+)') AS INT) as adr_house_no,
        regexp_extract(adr_house_no, '([A-Za-z]+$)') as adr_house_no_ad
    FROM subscriber_info_address_subscriber

    ''').show()
Sign up to request clarification or add additional context in comments.

Comments

1

hi I have figured out problem,

Since you are writing in sql format, as sql dont have \d option its not giving any value. Hence you need to write as '[0-9]+' to obtain any digit.

In your case replace as below:

spark.sql("SELECT adr_house_no as house_no, CAST(regexp_extract(adr_house_no, '([0-9]+)',1) AS INT) as adr_house_no, regexp_extract(adr_house_no, '([A-Za-z]+$)',1) as adr_house_no_ad FROM subscriber_info_address_subscriber").show()

Alternatively if you want to use regular expressions of python then you need to write your code in dataframes as below then it will work:

df.withColumn('house_no',regexp_extract('adr_house_no','(\d+)',1).cast('int')).withColumn('adr_house_no_ad',regexp_extract('adr_house_no', '([A-Za-z]+$)',1)).show()

2 Comments

hi Aleksander Lipka, please let me know whether it worked for you or not
Thank you for the interest in the problem and the solution. As I wrote, I knew that '([0-9]+)' will work. Do you know why \d is not supported in sql format? It's actually quite interesting why it isn't
0

since in regular expression,paranthesis indicates grouping.you have to mention grouping number also i.e. the group number which you want to extract. Grouping number starts from one. Suppose your pattern contains 3 groups and you need to extract 2nd one, then you mention 2.

In your case, there is one group and you need that one. Hence you need to write as regexp_extract('adr_house_no', '(\d+)',1).

Also note syntax for regexp_extract(str, pattern, idx) Extract a specific(idx) group identified by a java regex, from the specified string column.

1 Comment

This is a first thing I tried, the argument idx is optional. Still it doesn't work. The problem is with \d itself

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.