1

The goal is to get all the numbers from a text besides those which are either followed by or are trailing specific words/characters (including ignoring date). What I am struggling with is negative lookbehind

For example: 4.5 $55 1,200 wordA 3 sometext 2 wordB sometext 4.3charA sometext charB21.6 sometext 11/10/22

In the sample numbers 3, 2, 4.3, 21.6 and the date 11/10/22 would be ignored

My attempt https://regex101.com/r/PQvtOl/1/

(\d*\b[\.,]?\d+)(?!\d*? (?:wordB))(?!\d*?(?:charA))((?!\b[charB/])(?!\d+))

Any help would be greatly appreciated!

1 Answer 1

1

You can use

(?<!\d)\d{1,2}/\d{1,2}/\d{2}(?:\d{2})?(?!\d)|\b(?:charB|wordA)\s*\d*[.,]?\d+|(?<!\d[.,])(?<!\d)(\d*[.,]?\d+)(?!\s*(?:wordB|charA)|[.,]?\d)

Get only those matches that are captured into capturing group #1. See the regex demo. Details:

  • (?<!\d)\d{1,2}/\d{1,2}/\d{2}(?:\d{2})?(?!\d)| - a date-like string: no digit allowed immediately on the left, then one or two digits, /, one or two digits, /, and then two or four digits with no extra digit on the right allowed, or
  • \b(?:charB|wordA)\s*\d*[.,]?\d+ - a word boundary, then charB or wordA, zero or more whitespaces, zero or more digits, an optional dot or comma, one or more digits
  • | - or (the next part is captured, and re.findall will only output those in the resulting list, the above ones will be discarded)
  • (?<!\d[.,])(?<!\d)(\d*[.,]?\d+)(?!\s*(?:wordB|charA)|[.,]?\d) - no digit or digit and a . or , allowed immediately on the left, then zero or more digits, an optional . or , and one or more digits are captured into Group 1, and then the negative lookahead fails the match if there is wordB, charA or an optional . or , and a digit appear immediately on the right after any zero or more whitespaces.

See the Python demo:

import re
text = '4.5 $55 1,200 wordA 3 sometext 2 wordB sometext 4.3charA sometext charB21.6 sometext 11/10/22'
rx = r'(?<!\d)\d{1,2}/\d{1,2}/\d{2}(?:\d{2})?(?!\d)|\b(?:charB|wordA)\s*\d*[.,]?\d+|(?<!\d[.,])(?<!\d)(\d*[.,]?\d+)(?!\s*(?:wordB|charA)|[.,]?\d)'
matches = re.findall(rx, text)
print( [ m for m in matches if m ] )
# => ['4.5', '55', '1,200']
Sign up to request clarification or add additional context in comments.

4 Comments

Thanks a lot! this would not work for a large number though, for example 120,990,000.
@jokol This is not about "large numbers", but number formats. There is nothing in your question that would hint at the number formats you expect to handle. Try (?<!\d)\d{1,2}/\d{1,2}/\d{2}(?:\d{2})?(?!\d)|\b(?:charB|wordA)\s*\d*[.,]?\d+|(?<!\d[.,])(?<!\d)((?:\d{1,3}(?:[.,]\d{3})*|\d+)(?:\.\d+)?)(?!\s*(?:wordB|charA)|[.,]?\d), see this regex demo. And this Python demo.
thanks! accepted the answer. Was curious though, what if I wanted to skip all 4 digit numbers (besides all the constraints above)? Spent some time and came up with this (?!\d{4}[:\s\.,])(?<!\d[.,])(?<!\d)((?:\d+(?:[.,]\d{3})*|\d+)(?:\.\d+)?) but to no avail @Wiktor Stribiżew
@jokol Just add \d{4} alternative to the first pattern, see this regex demo. It will make it match these 4-digit chunks and they will get omitted.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.