Python regex: Getting all numbers besides some which are followed by specific terms

Question

The goal is to get all the numbers from a text besides those which are either followed by or are trailing specific words/characters (including ignoring date). What I am struggling with is negative lookbehind

For example: 4.5 $55 1,200 wordA 3 sometext 2 wordB sometext 4.3charA sometext charB21.6 sometext 11/10/22

In the sample numbers 3, 2, 4.3, 21.6 and the date 11/10/22 would be ignored

My attempt https://regex101.com/r/PQvtOl/1/

(\d*\b[\.,]?\d+)(?!\d*? (?:wordB))(?!\d*?(?:charA))((?!\b[charB/])(?!\d+))

Any help would be greatly appreciated!

Wiktor Stribiżew · Accepted Answer · 2021-12-05 23:06:57Z

1

You can use

(?<!\d)\d{1,2}/\d{1,2}/\d{2}(?:\d{2})?(?!\d)|\b(?:charB|wordA)\s*\d*[.,]?\d+|(?<!\d[.,])(?<!\d)(\d*[.,]?\d+)(?!\s*(?:wordB|charA)|[.,]?\d)

Get only those matches that are captured into capturing group #1. See the regex demo. Details:

(?<!\d)\d{1,2}/\d{1,2}/\d{2}(?:\d{2})?(?!\d)| - a date-like string: no digit allowed immediately on the left, then one or two digits, /, one or two digits, /, and then two or four digits with no extra digit on the right allowed, or
\b(?:charB|wordA)\s*\d*[.,]?\d+ - a word boundary, then charB or wordA, zero or more whitespaces, zero or more digits, an optional dot or comma, one or more digits
| - or (the next part is captured, and re.findall will only output those in the resulting list, the above ones will be discarded)
(?<!\d[.,])(?<!\d)(\d*[.,]?\d+)(?!\s*(?:wordB|charA)|[.,]?\d) - no digit or digit and a . or , allowed immediately on the left, then zero or more digits, an optional . or , and one or more digits are captured into Group 1, and then the negative lookahead fails the match if there is wordB, charA or an optional . or , and a digit appear immediately on the right after any zero or more whitespaces.

See the Python demo:

import re
text = '4.5 $55 1,200 wordA 3 sometext 2 wordB sometext 4.3charA sometext charB21.6 sometext 11/10/22'
rx = r'(?<!\d)\d{1,2}/\d{1,2}/\d{2}(?:\d{2})?(?!\d)|\b(?:charB|wordA)\s*\d*[.,]?\d+|(?<!\d[.,])(?<!\d)(\d*[.,]?\d+)(?!\s*(?:wordB|charA)|[.,]?\d)'
matches = re.findall(rx, text)
print( [ m for m in matches if m ] )
# => ['4.5', '55', '1,200']

edited Dec 5, 2021 at 23:06

answered Dec 5, 2021 at 23:00

Wiktor Stribiżew

631k41 gold badges502 silver badges632 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

jokol Over a year ago

Thanks a lot! this would not work for a large number though, for example 120,990,000.

Wiktor Stribiżew Over a year ago

@jokol This is not about "large numbers", but number formats. There is nothing in your question that would hint at the number formats you expect to handle. Try

(?<!\d)\d{1,2}/\d{1,2}/\d{2}(?:\d{2})?(?!\d)|\b(?:charB|wordA)\s*\d*[.,]?\d+|(?<!\d[.,])(?<!\d)((?:\d{1,3}(?:[.,]\d{3})*|\d+)(?:\.\d+)?)(?!\s*(?:wordB|charA)|[.,]?\d)

, see this regex demo. And this Python demo.

jokol Over a year ago

thanks! accepted the answer. Was curious though, what if I wanted to skip all 4 digit numbers (besides all the constraints above)? Spent some time and came up with this (?!\d{4}[:\s\.,])(?<!\d[.,])(?<!\d)((?:\d+(?:[.,]\d{3})*|\d+)(?:\.\d+)?) but to no avail @Wiktor Stribiżew

Wiktor Stribiżew Over a year ago

@jokol Just add \d{4} alternative to the first pattern, see this regex demo. It will make it match these 4-digit chunks and they will get omitted.

Collectives™ on Stack Overflow

Python regex: Getting all numbers besides some which are followed by specific terms

1 Answer 1

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related