Regex to extract both integer or float values followed by a unit, Python

Question

I want to build a regex which captures all patterns inside a string where an integer or a floating point number is present before an unit of measurement(ml,mg,kg etc). My current regex only considers integers and breaks when there's a space. I want to handle these in my code.

p = re.compile('[0-9](?:mg|kg|ml|q.s.|ui|M|g|µg)')
x = '0.9mg is the approximate dosage'
z = p.findall(x)
print(z)

which doesn't work for decimals and also breaks when there's a space.

Expected patterns to be captured are:

Examples: 0.9 mg, 9 mg, 9mg, 0.9mg

Any help regarding this

Using the regex in the code:

mg = []
newregex = r"[0-9\.\s]+(?:mg|kg|ml|q.s.|ui|M|g|µg)" 
for s in zz:
    for e in extracteddata:
        v = re.search(newregex,extracteddata,flags=re.IGNORECASE|re.MULTILINE)
        if v:
            mg.append(v.group(0))

game0ver · Accepted Answer · 2019-12-05 19:47:56Z

6

You can try with this:

([.\d]+)\s*(?:mg|kg|ml|q.s.|ui|M|g|µg)

Try it online.

answered Dec 5, 2019 at 19:47

game0ver

1,29010 silver badges22 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

techsmart Over a year ago

Thanks. I am trying to save the matched patterns in a list but getting error "ValueError: cannot process flags argument with a compiled pattern". Will be great if you can help me to solve it. Added code in the question for reference.

game0ver Over a year ago

The actual number extracted from the string pattern can be aquired with .group(1). So try with mg.append(re.search(newregex,extracteddata,flags=re.IGNORECASE|re.MULTILINE)).group(1). Also it would be a good idea to move the pattern from inside to outside of the loop so it doesn't have to be compiled every single time.

techsmart Over a year ago

I tried it but it's not working. I have updated the code with some changes and now getting "TypeError: expected string or bytes-like object" in the line v = re.search(newregex,extracteddata,flags=re.IGNORECASE|re.MULTILINE). Any idea on that

game0ver Over a year ago

There is no .group(0) but a .group(1)

techsmart Over a year ago

This you are talking about the previous code I posted? where mg.append(re.search(newregex,extracteddata,flags=re.IGNORECASE|re.MULTILINE)) was there?

Nick Reed · Accepted Answer · 2019-12-05 19:52:27Z

3

(?<!\d|\.)\d+(?:\.\d+)?\s*?(?:mg|kg|ml|q\.s\.|ui|M|g|µg)(?!\w)

This regex will select properly formatted numbers with properly formatted units after them, and will reject badly formed numbers or non-existent units.

(?<!\d|\.) - make sure there's no digits or decimal points before this number.
\d+ - get one or more digits.
(?:\.\d+)? - optionally get a decimal point, followed by one or more digits.
\s*? - get zero to unlimited whitespace characters, as few as possible.
(?:mg|kg|ml|q\.s\.|ui|M|g|µg) - capture one of the listed units.
(?!\w) - make sure there's no extra data following the captured unit.

Regex demo

import re

p = re.compile('(?<!\d|\.)\d+(?:\.\d+)?\s*?(?:mg|kg|ml|q\.s\.|ui|M|g|µg)(?!\w)')
x = 'Examples: 0.9 mg, 9 mg, 9mg, 0.9mg'

print(p.findall(x))

['0.9 mg', '9 mg', '9mg', '0.9mg']

Python demo

edited Dec 5, 2019 at 19:52

answered Dec 5, 2019 at 19:46

Nick Reed

5,1094 gold badges19 silver badges39 bronze badges

2 Comments

techsmart Over a year ago

Thanks. I am trying to save the matched patterns in a list but getting error "ValueError: cannot process flags argument with a compiled pattern". Will be great if you can help me to solve it. Added code in the question for reference.

Nick Reed Over a year ago

@techsmart That error is saying you gotta put the re.IGNORECASE|re.MULTILINE flags into your re.compile statement, not your re.search statement.

Arkistarvh Kltzuonstev · Accepted Answer · 2019-12-05 19:49:19Z

2

Try this :

x = '9mg 9.0mg  0 mg .009 mg is the approximate dosage'
p = re.compile('[0-9\.\s]+(?:mg|kg|ml|q.s.|ui|M|g|µg)')
p.findall(x)

output :

['9mg', ' 9.0mg', '  0 mg', ' .009 mg']

edited Dec 5, 2019 at 19:49

answered Dec 5, 2019 at 19:37

Arkistarvh Kltzuonstev

6,9837 gold badges32 silver badges62 bronze badges

4 Comments

techsmart Over a year ago

It's not working if there's a space between 0.9 and mg in the string. Example: "x = 0.9 mg is the approximate dosage"

techsmart Over a year ago

Thanks. I am trying to save the matched patterns in a list but getting error "ValueError: cannot process flags argument with a compiled pattern". Will be great if you can help me to solve it. Added code in the question for reference.

Arkistarvh Kltzuonstev Over a year ago

@techsmart Do you need the output ['9mg', ' 9.0mg', ' 0 mg', ' .009 mg'] in a variable? then you can just do z = p.findall(x). Also if you do re.search then you need to do .group(0) on the object. It would be better findall or finditer.

techsmart Over a year ago

No actually I don't need all matches. I have updated the code above and now getting "TypeError: expected string or bytes-like object" in the line v = re.search(newregex,extracteddata,flags=re.IGNORECASE|re.MULTILINE). Any idea on that

Benison Sam · Accepted Answer · 2020-09-01 18:29:54Z

This answer is not just a specific solution to the question but rather is an attempt at a general solution to match strings with numbers, with or without precision, with or without units and with or without thousand separators in both European (.,) and International (.,) formats

(((\d{4,}(?:[\.,]\d+)?)|((\d{1,3}(?:((\.)|,)\d{1,3})?(?:\6\d{1,3})*(?:(?(7),|\.)\d+)?)))\s*(?:[a-zA-Z]*))

\d{1,3} - selects one to three digits.
\d{4,} - selects one to three digits.
((\.)|,) - select . (European) or , (International) for Thousand separator.
\6 - select only previously matched Thousand separator.
(?(7),|\.) - if-else conditional which matches precision separator , (European) or . (International) based on the previously matched Thousand separator.
\s*?(?:[a-zA-Z]+) - selects a unit after the matched number with or without preceding space character.

Regex Demo

import re

p = re.compile('(((\d{4,}(?:[\.,]\d+)?)|((\d{1,3}(?:((\.)|,)\d{1,3})?(?:\6\d{1,3})*(?:(?(7),|\.)\d+)?)))\s*(?:[a-zA-Z]*))')
x = 'Examples: 0.9 mg, 9 mg, 9mg, 0.9mg'

print([item[0]for item in p.findall(x)])

Output: ['0.9 mg', '9 mg', '9mg', '0.9mg']

Python Demo

Collectives™ on Stack Overflow

Regex to extract both integer or float values followed by a unit, Python

4 Answers 4

5 Comments

2 Comments

4 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

5 Comments

2 Comments

4 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related