Regex for this string? Python

Question

For this string:

London:Jan 48.0,Feb 38.9,Mar 39.9,Apr 42.2,May 47.3,Jun 52.1,Jul 59.5,Aug 57.2,Sep 55.4,Oct 62.0,Nov 59.0,Dec 52.9

I'd like to find all those numbers. (And only numbers)

I'm trying these ones:

re.findall('\s(.*),', string)
re.findall(' (.*),', string)
re.findall('\s++.+,', string)
re.findall('\s{2}.{1},', string)

But nothing seems to work.

I suppose you know \s is single whitespace character. There is similarly \d for digits. Have tried to research further? — Austin
– Austin, Commented Apr 28, 2020 at 2:47

jameshollisandrew · Accepted Answer · 2020-04-28 07:19:03Z

Let’s review your four initial patterns and cover their syntax, then we can consider a few expressions that match the string you’re looking to match (ie 00.0).

Reviewing Patterns

re.findall('\s(.*),', string)

This pattern reads: Find all single whitespace character (\s), 0 or more repetitions of any character except a newline (.*), and a comma (, ).

This pattern will most likely match the entire string since repetition qualifiers are greedy (i.e. any of the expression characters + * ? will continue to match any character that returns a match for the previous expression character. When we use ‘.*’ in an expression, it will almost always capture the entire string because it will greedily match all characters that aren’t newline.

re.findall(' (.*),', string)

Same problem as previous pattern.

re.findall('\s++.+,', string)

I don’t think Python re accepts repetition qualifiers referencing another repetition qualifier without escaping it. Using ‘++’ would fail unless the first ‘+’ is preceded by a ‘\‘ like this: ‘++’. However, that expression reads: Match one or more ‘+’ characters (‘++). The expression part ‘.+’ matches one or more repetitions of any character that isn’t a newline (‘.+’) and falls prey to the greedy problem.

re.findall('\s{2}.{1},', string)

Squiggly brackets are repetition qualifiers that allow for a range of repetitions to be input. They follow the syntax, ‘{m, n}’ where m is the least amount of matches, and n is the most. For example, a pattern AB{3, 4} will not match ABB but it will match ABBB or ABBBB.

The pattern above looks to match: 2 repetitions of any white space character (‘\s{2}’) followed by any one character that is not a newline (‘.{1}’) followed by a comma.

Here are a couple different patterns to try out - I’ll touch on the syntax as well.

import re

p = ‘[0-9][0-9]\.[0-9]’
s = ‘ London:Jan 48.0,Feb 38.9,Mar 39.9,Apr 42.2,May 47.3,Jun 52.1,Jul 59.5,Aug 57.2,Sep 55.4,Oct 62.0,Nov 59.0,Dec 52.9’

if re.search(p, s):
    m = re.findall(p, s)

print(m)

Note unless you know 100% that each input string contains the pattern you look to match, it’s helpful to test the string prior to executing the match. One way we can test the string is with an if clause checking for the occurrence of a match for re.search(p, s) where p is a variable for some pattern, and s is a variable for some string.

p = ‘[0-9][0-9]\.[0-9]’

This pattern will match: one number digit 0-9 (‘[0-9]’) followed by one number digit 0-9 (‘[0-9]’) followed by a single occurrence of period (‘.’) followed by one number digit 0-9 (‘[0-9]’). For example, this pattern will match the string 19.9 or 40.0 but not 40. or 40. The string ‘[0-9]’ utilizes brackets to identify a set in regex. With a set, any of the characters included in the brackets can be matched for that one spot. For example, [A5] will match A or 5 but not A5. Just like other literal characters, repetition qualifiers will work on a set. So we can use [A5]{1,2} to also match A5.

Note: The reason this expression registers the period as a period is because it is preceded by a backspace (I.e. it is escaped from its special class) so it no longer will match ‘any character that is not a newline character.’

‘[0-9]{2}\.[0-9]{1}’

This pattern does the same thing as above but uses the curly brackets to set a constant for the number of repetitions (rather than repeat the set twice like the previous pattern).

‘\d{2}\.\d{1}’

This pattern uses the special pattern \d to match any decimal digit (ie any number). It is equivalent to using the set [0-9] as used above.

It’s worth noting that technically the . doesn’t need to be escaped, since the period character ‘.’ is included in the class of ‘any character that isn’t a newline.’ However, it makes the pattern less robust, since it will (inaccurately) match any character that isn’t a newline in that spot. For example, it will match 29.9 or 29A9 or 2909 (As they all have a non-newline character in the 3rd position.

Hope this helps!

Thanks a lot for elaborate, it was really really helpful! :-)

duckboycool · Accepted Answer · 2020-04-28 02:45:48Z

1

re.findall(r'\d+[.]?\d*', string) gets me ['48.0', '38.9', '39.9', '42.2', '47.3', '52.1', '59.5', '57.2', '55.4', '62.0', '59.0', '52.9']

answered Apr 28, 2020 at 2:45

duckboycool

2,4652 gold badges10 silver badges24 bronze badges

2 Comments

Georgia Fernández Over a year ago

It works perfectly, thanks! even though, could you elaborate how does it work?

duckboycool Over a year ago

\d just means any digit, so it looks for one or more digits, then an optional period, then 0 or more digits after that. Also I don't think it mattered here, but one issue is that you weren't escaping the regex, so if you happened to have a letter after the backslash that python interprets as a special character, that wouldn't work. That's why I have the r before the string.

Samwise · Accepted Answer · 2020-04-28 02:48:43Z

1

Since you don't have any stray periods, any substring that's a combination of digits and decimal points will do:

>>> re.findall(r'[\d.]+', nums)
['48.0', '38.9', '39.9', '42.2', '47.3', '52.1', '59.5', '57.2', '55.4', '62.0', '59.0', '52.9']

If by "all those numbers" you meant just the integers (i.e. the periods are separators rather than decimal points), it's easier:

>>> re.findall(r'\d+', nums)
['48', '0', '38', '9', '39', '9', '42', '2', '47', '3', '52', '1', '59', '5', '57', '2', '55', '4', '62', '0', '59', '0', '52', '9']

answered Apr 28, 2020 at 2:48

Samwise

72.1k3 gold badges36 silver badges52 bronze badges

1 Comment

Samwise Over a year ago

Square brackets define a "character class". [\d.] means "anything that's either a digit (\d) or a literal period (.)". The + after it means "at least one of those but maybe more".

Toto · Accepted Answer · 2020-04-28 10:46:22Z

1

Use:

import re

str = 'London:Jan 48.0,Feb 38.9,Mar 39.9,Apr 42.2,May 47.3,Jun 52.1,Jul 59.5,Aug 57.2,Sep 55.4,Oct 62.0,Nov 59.0,Dec 52.9'
num_arr = re.findall(r'\d+(?:\.\d+)?', str)
print num_arr

Output:

['48.0', '38.9', '39.9', '42.2', '47.3', '52.1', '59.5', '57.2', '55.4', '62.0', '59.0', '52.9']

Demo & explanation

answered Apr 28, 2020 at 10:46

Toto

91.7k63 gold badges97 silver badges135 bronze badges

Collectives™ on Stack Overflow

Regex for this string? Python

4 Answers 4

2 Comments

2 Comments

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

2 Comments

2 Comments

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related