2

For this string:

London:Jan 48.0,Feb 38.9,Mar 39.9,Apr 42.2,May 47.3,Jun 52.1,Jul 59.5,Aug 57.2,Sep 55.4,Oct 62.0,Nov 59.0,Dec 52.9

I'd like to find all those numbers. (And only numbers)

I'm trying these ones:

re.findall('\s(.*),', string)
re.findall(' (.*),', string)
re.findall('\s++.+,', string)
re.findall('\s{2}.{1},', string)

But nothing seems to work.

1
  • I suppose you know \s is single whitespace character. There is similarly \d for digits. Have tried to research further? Commented Apr 28, 2020 at 2:47

4 Answers 4

1

Let’s review your four initial patterns and cover their syntax, then we can consider a few expressions that match the string you’re looking to match (ie 00.0).

Reviewing Patterns

re.findall('\s(.*),', string)

This pattern reads: Find all single whitespace character (\s), 0 or more repetitions of any character except a newline (.*), and a comma (, ).

This pattern will most likely match the entire string since repetition qualifiers are greedy (i.e. any of the expression characters + * ? will continue to match any character that returns a match for the previous expression character. When we use ‘.*’ in an expression, it will almost always capture the entire string because it will greedily match all characters that aren’t newline.

re.findall(' (.*),', string)

Same problem as previous pattern.

re.findall('\s++.+,', string)

I don’t think Python re accepts repetition qualifiers referencing another repetition qualifier without escaping it. Using ‘++’ would fail unless the first ‘+’ is preceded by a ‘\‘ like this: ‘++’. However, that expression reads: Match one or more ‘+’ characters (‘++). The expression part ‘.+’ matches one or more repetitions of any character that isn’t a newline (‘.+’) and falls prey to the greedy problem.

re.findall('\s{2}.{1},', string)

Squiggly brackets are repetition qualifiers that allow for a range of repetitions to be input. They follow the syntax, ‘{m, n}’ where m is the least amount of matches, and n is the most. For example, a pattern AB{3, 4} will not match ABB but it will match ABBB or ABBBB.

The pattern above looks to match: 2 repetitions of any white space character (‘\s{2}’) followed by any one character that is not a newline (‘.{1}’) followed by a comma.

Here are a couple different patterns to try out - I’ll touch on the syntax as well.

import re

p = ‘[0-9][0-9]\.[0-9]’
s = ‘ London:Jan 48.0,Feb 38.9,Mar 39.9,Apr 42.2,May 47.3,Jun 52.1,Jul 59.5,Aug 57.2,Sep 55.4,Oct 62.0,Nov 59.0,Dec 52.9’

if re.search(p, s):
    m = re.findall(p, s)

print(m)

Note unless you know 100% that each input string contains the pattern you look to match, it’s helpful to test the string prior to executing the match. One way we can test the string is with an if clause checking for the occurrence of a match for re.search(p, s) where p is a variable for some pattern, and s is a variable for some string.

p = ‘[0-9][0-9]\.[0-9]’

This pattern will match: one number digit 0-9 (‘[0-9]’) followed by one number digit 0-9 (‘[0-9]’) followed by a single occurrence of period (‘.’) followed by one number digit 0-9 (‘[0-9]’). For example, this pattern will match the string 19.9 or 40.0 but not 40. or 40. The string ‘[0-9]’ utilizes brackets to identify a set in regex. With a set, any of the characters included in the brackets can be matched for that one spot. For example, [A5] will match A or 5 but not A5. Just like other literal characters, repetition qualifiers will work on a set. So we can use [A5]{1,2} to also match A5.

Note: The reason this expression registers the period as a period is because it is preceded by a backspace (I.e. it is escaped from its special class) so it no longer will match ‘any character that is not a newline character.’

‘[0-9]{2}\.[0-9]{1}’ 

This pattern does the same thing as above but uses the curly brackets to set a constant for the number of repetitions (rather than repeat the set twice like the previous pattern).

‘\d{2}\.\d{1}’

This pattern uses the special pattern \d to match any decimal digit (ie any number). It is equivalent to using the set [0-9] as used above.

It’s worth noting that technically the . doesn’t need to be escaped, since the period character ‘.’ is included in the class of ‘any character that isn’t a newline.’ However, it makes the pattern less robust, since it will (inaccurately) match any character that isn’t a newline in that spot. For example, it will match 29.9 or 29A9 or 2909 (As they all have a non-newline character in the 3rd position.

Hope this helps!

Sign up to request clarification or add additional context in comments.

2 Comments

Thanks a lot for elaborate, it was really really helpful! :-)
Awesome glad it helped!
1

re.findall(r'\d+[.]?\d*', string) gets me ['48.0', '38.9', '39.9', '42.2', '47.3', '52.1', '59.5', '57.2', '55.4', '62.0', '59.0', '52.9']

2 Comments

It works perfectly, thanks! even though, could you elaborate how does it work?
\d just means any digit, so it looks for one or more digits, then an optional period, then 0 or more digits after that. Also I don't think it mattered here, but one issue is that you weren't escaping the regex, so if you happened to have a letter after the backslash that python interprets as a special character, that wouldn't work. That's why I have the r before the string.
1

Since you don't have any stray periods, any substring that's a combination of digits and decimal points will do:

>>> re.findall(r'[\d.]+', nums)
['48.0', '38.9', '39.9', '42.2', '47.3', '52.1', '59.5', '57.2', '55.4', '62.0', '59.0', '52.9']

If by "all those numbers" you meant just the integers (i.e. the periods are separators rather than decimal points), it's easier:

>>> re.findall(r'\d+', nums)
['48', '0', '38', '9', '39', '9', '42', '2', '47', '3', '52', '1', '59', '5', '57', '2', '55', '4', '62', '0', '59', '0', '52', '9']

1 Comment

Square brackets define a "character class". [\d.] means "anything that's either a digit (\d) or a literal period (.)". The + after it means "at least one of those but maybe more".
1

Use:

import re

str = 'London:Jan 48.0,Feb 38.9,Mar 39.9,Apr 42.2,May 47.3,Jun 52.1,Jul 59.5,Aug 57.2,Sep 55.4,Oct 62.0,Nov 59.0,Dec 52.9'
num_arr = re.findall(r'\d+(?:\.\d+)?', str)
print num_arr

Output:

['48.0', '38.9', '39.9', '42.2', '47.3', '52.1', '59.5', '57.2', '55.4', '62.0', '59.0', '52.9']

Demo & explanation

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.