6

I want to build a regex which captures all patterns inside a string where an integer or a floating point number is present before an unit of measurement(ml,mg,kg etc). My current regex only considers integers and breaks when there's a space. I want to handle these in my code.

p = re.compile('[0-9](?:mg|kg|ml|q.s.|ui|M|g|µg)')
x = '0.9mg is the approximate dosage'
z = p.findall(x)
print(z)

which doesn't work for decimals and also breaks when there's a space.

Expected patterns to be captured are:

Examples: 0.9 mg, 9 mg, 9mg, 0.9mg

Any help regarding this

Using the regex in the code:

mg = []
newregex = r"[0-9\.\s]+(?:mg|kg|ml|q.s.|ui|M|g|µg)" 
for s in zz:
    for e in extracteddata:
        v = re.search(newregex,extracteddata,flags=re.IGNORECASE|re.MULTILINE)
        if v:
            mg.append(v.group(0))

4 Answers 4

6

You can try with this:

([.\d]+)\s*(?:mg|kg|ml|q.s.|ui|M|g|µg)

Try it online.

Sign up to request clarification or add additional context in comments.

5 Comments

Thanks. I am trying to save the matched patterns in a list but getting error "ValueError: cannot process flags argument with a compiled pattern". Will be great if you can help me to solve it. Added code in the question for reference.
The actual number extracted from the string pattern can be aquired with .group(1). So try with mg.append(re.search(newregex,extracteddata,flags=re.IGNORECASE|re.MULTILINE)).group(1). Also it would be a good idea to move the pattern from inside to outside of the loop so it doesn't have to be compiled every single time.
I tried it but it's not working. I have updated the code with some changes and now getting "TypeError: expected string or bytes-like object" in the line v = re.search(newregex,extracteddata,flags=re.IGNORECASE|re.MULTILINE). Any idea on that
There is no .group(0) but a .group(1)
This you are talking about the previous code I posted? where mg.append(re.search(newregex,extracteddata,flags=re.IGNORECASE|re.MULTILINE)) was there?
3

(?<!\d|\.)\d+(?:\.\d+)?\s*?(?:mg|kg|ml|q\.s\.|ui|M|g|µg)(?!\w)

This regex will select properly formatted numbers with properly formatted units after them, and will reject badly formed numbers or non-existent units.

  • (?<!\d|\.) - make sure there's no digits or decimal points before this number.
  • \d+ - get one or more digits.
  • (?:\.\d+)? - optionally get a decimal point, followed by one or more digits.
  • \s*? - get zero to unlimited whitespace characters, as few as possible.
  • (?:mg|kg|ml|q\.s\.|ui|M|g|µg) - capture one of the listed units.
  • (?!\w) - make sure there's no extra data following the captured unit.

Regex demo


import re

p = re.compile('(?<!\d|\.)\d+(?:\.\d+)?\s*?(?:mg|kg|ml|q\.s\.|ui|M|g|µg)(?!\w)')
x = 'Examples: 0.9 mg, 9 mg, 9mg, 0.9mg'

print(p.findall(x))

['0.9 mg', '9 mg', '9mg', '0.9mg']

Python demo

2 Comments

Thanks. I am trying to save the matched patterns in a list but getting error "ValueError: cannot process flags argument with a compiled pattern". Will be great if you can help me to solve it. Added code in the question for reference.
@techsmart That error is saying you gotta put the re.IGNORECASE|re.MULTILINE flags into your re.compile statement, not your re.search statement.
2

Try this :

x = '9mg 9.0mg  0 mg .009 mg is the approximate dosage'
p = re.compile('[0-9\.\s]+(?:mg|kg|ml|q.s.|ui|M|g|µg)')
p.findall(x)

output :

['9mg', ' 9.0mg', '  0 mg', ' .009 mg']

4 Comments

It's not working if there's a space between 0.9 and mg in the string. Example: "x = 0.9 mg is the approximate dosage"
Thanks. I am trying to save the matched patterns in a list but getting error "ValueError: cannot process flags argument with a compiled pattern". Will be great if you can help me to solve it. Added code in the question for reference.
@techsmart Do you need the output ['9mg', ' 9.0mg', ' 0 mg', ' .009 mg'] in a variable? then you can just do z = p.findall(x). Also if you do re.search then you need to do .group(0) on the object. It would be better findall or finditer.
No actually I don't need all matches. I have updated the code above and now getting "TypeError: expected string or bytes-like object" in the line v = re.search(newregex,extracteddata,flags=re.IGNORECASE|re.MULTILINE). Any idea on that
0

This answer is not just a specific solution to the question but rather is an attempt at a general solution to match strings with numbers, with or without precision, with or without units and with or without thousand separators in both European (.,) and International (.,) formats

(((\d{4,}(?:[\.,]\d+)?)|((\d{1,3}(?:((\.)|,)\d{1,3})?(?:\6\d{1,3})*(?:(?(7),|\.)\d+)?)))\s*(?:[a-zA-Z]*))

  • \d{1,3} - selects one to three digits.
  • \d{4,} - selects one to three digits.
  • ((\.)|,) - select . (European) or , (International) for Thousand separator.
  • \6 - select only previously matched Thousand separator.
  • (?(7),|\.) - if-else conditional which matches precision separator , (European) or . (International) based on the previously matched Thousand separator.
  • \s*?(?:[a-zA-Z]+) - selects a unit after the matched number with or without preceding space character.

Regex Demo


import re

p = re.compile('(((\d{4,}(?:[\.,]\d+)?)|((\d{1,3}(?:((\.)|,)\d{1,3})?(?:\6\d{1,3})*(?:(?(7),|\.)\d+)?)))\s*(?:[a-zA-Z]*))')
x = 'Examples: 0.9 mg, 9 mg, 9mg, 0.9mg'

print([item[0]for item in p.findall(x)])

Output: ['0.9 mg', '9 mg', '9mg', '0.9mg']

Python Demo

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.