-4

I'm trying to filter out potential Citizen Service Numbers (BSN in Dutch) in specific texts which are also full of Dutch phone numbers. The phone numbers start with the +31 country code, while BSN numbers not.

Could someone help me to come up with the regular expression to match any 9-digit number that does not start with the +<country-code-like-prefix><space>?

For example, in the sentence:

The number is +31 713176319 and 650068168 is another one.

I'd like to extract 650068168, but not 713176319. This might be solved by negative lookahead, but I was not able to find the right solution.

2
  • According to the [country code][1], we have the country codes like + 1-xxx. For example, the country Anguilla has the code 1-264. Doesn't regex for 9-digit number start from +1 264 xxxxxxxxx? [1]: countrycode.org Commented Sep 26, 2020 at 10:15
  • Thank you. @ThanLUONG I added more details to my questions to explain that it is specific to Dutch. Commented Sep 27, 2020 at 7:13

2 Answers 2

1

Use a negative Lookbehind:

(?<!\+\d\d )\b\d{9}\b

This ensures that the 9-digit number is not preceded by ("+" followed by two digits followed by a space character).

Demo.

Note that this will only work when the country code is two digits as in your example. To support country codes with one or 3 digits, things get a little tricky because python doesn't support Lookbehinds with non-fixed width. You could, however, use multiple Lookbehinds like this:

(?<!\+\d )(?<!\+\d{2} )(?<!\+\d{3} )\b\d{9}\b

Demo.

Sign up to request clarification or add additional context in comments.

1 Comment

Wow, I knew about the look ahead feature but never thought there also a look behind. Thank you so much.
1

I suggest using re.findall here:

inp = "The number is +31 713176319 and 650068168 is another one."
matches = re.findall(r'(?:^|(?<!\S)(?!\+\d+)\S+ )(\d{9})\b', inp)
print(matches)

This prints:

['650068168']

The regex strategy here is to match a 9 digit standalone number when either it appears at the very start of the string, or it is preceded by some "word" (word being loosely defined as \S+ here) which is not a country code prefix.

Here is an explanation of the regex used:

(?:
    ^          from the start of the string
    |          OR
    (?<!\S)    assert that what precedes is whitespace or start of the string
    (?!\+\d+)  assert that what follows is NOT a country code prefix
    \S+        match the non prefix "word", followed by a space
)
(\d{9})        match and capture the 9 digit number
\b             word boundary

1 Comment

Thanks for explaining this, I learned a lot about look behind.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.