Regex for 9-digit number that does not start from the country code like prefix [closed]

Question

Closed. This question needs details or clarity. It is not currently accepting answers.

Want to improve this question? As written, this question is lacking some of the information it needs to be answered. If the author adds details in comments, consider editing them into the question. Once there's sufficient detail to answer, vote to reopen the question.

Closed 5 years ago.

Improve this question

I'm trying to filter out potential Citizen Service Numbers (BSN in Dutch) in specific texts which are also full of Dutch phone numbers. The phone numbers start with the +31 country code, while BSN numbers not.

Could someone help me to come up with the regular expression to match any 9-digit number that does not start with the +<country-code-like-prefix><space>?

For example, in the sentence:

The number is +31 713176319 and 650068168 is another one.

I'd like to extract 650068168, but not 713176319. This might be solved by negative lookahead, but I was not able to find the right solution.

According to the [country code][1], we have the country codes like + 1-xxx. For example, the country Anguilla has the code 1-264. Doesn't regex for 9-digit number start from +1 264 xxxxxxxxx? [1]: countrycode.org — Thân LƯƠNG Đình
– Thân LƯƠNG Đình, Commented Sep 26, 2020 at 10:15
Thank you. @ThanLUONG I added more details to my questions to explain that it is specific to Dutch. — Mykola
– Mykola, Commented Sep 27, 2020 at 7:13

41686d6564 · Accepted Answer · 2020-09-26 08:54:28Z

1

Use a negative Lookbehind:

(?<!\+\d\d )\b\d{9}\b

This ensures that the 9-digit number is not preceded by ("+" followed by two digits followed by a space character).

Demo.

Note that this will only work when the country code is two digits as in your example. To support country codes with one or 3 digits, things get a little tricky because python doesn't support Lookbehinds with non-fixed width. You could, however, use multiple Lookbehinds like this:

(?<!\+\d )(?<!\+\d{2} )(?<!\+\d{3} )\b\d{9}\b

Demo.

edited Sep 26, 2020 at 8:54

answered Sep 26, 2020 at 8:46

41686d6564

19.8k13 gold badges48 silver badges84 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Mykola Over a year ago

Wow, I knew about the look ahead feature but never thought there also a look behind. Thank you so much.

Tim Biegeleisen · Accepted Answer · 2020-09-26 09:02:53Z

1

I suggest using re.findall here:

inp = "The number is +31 713176319 and 650068168 is another one."
matches = re.findall(r'(?:^|(?<!\S)(?!\+\d+)\S+ )(\d{9})\b', inp)
print(matches)

This prints:

['650068168']

The regex strategy here is to match a 9 digit standalone number when either it appears at the very start of the string, or it is preceded by some "word" (word being loosely defined as \S+ here) which is not a country code prefix.

Here is an explanation of the regex used:

(?:
    ^          from the start of the string
    |          OR
    (?<!\S)    assert that what precedes is whitespace or start of the string
    (?!\+\d+)  assert that what follows is NOT a country code prefix
    \S+        match the non prefix "word", followed by a space
)
(\d{9})        match and capture the 9 digit number
\b             word boundary

edited Sep 26, 2020 at 9:02

answered Sep 26, 2020 at 8:56

Tim Biegeleisen

526k32 gold badges323 silver badges399 bronze badges

1 Comment

Mykola Over a year ago

Thanks for explaining this, I learned a lot about look behind.

Collectives™ on Stack Overflow

Regex for 9-digit number that does not start from the country code like prefix [closed]

2 Answers 2

1 Comment

1 Comment

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

1 Comment

Related