Python finding regex in a String

Question

I'm trying to find all cases of money values in a string called webpage.

String webpage is the text from this webpage, in my program it's just hardcoded because that's all that is needed, but I won't paste it all here.

regex = r'^[$£€]?(([\d]{1,3},([\d]{3},)*[\d]{3}|[0-9]+)(\.[0-9][0-9])?(\s?bn|\s?mil|\s?euro[s]?|\s?dollar[s]?|\s?pound[s]?|p){0,2})'
res = re.findall(regex, webpage)
print(res)

it's returning [], I expected it to return [$131bn, £100bn, $100bn, $17.4bn]

so, you should NOT parse a web-page with regex. There are other good proper tools — RomanPerekhrest
– RomanPerekhrest, Commented Dec 14, 2017 at 14:33
My answer here (stackoverflow.com/a/37571199/2064981) might help you ;) — SamWhan
– SamWhan, Commented Dec 14, 2017 at 14:33
There is no way this regex will match anything as it matches only at the beginning of the string, because of '^stuff'. So it looks like you don't want to match at the very beginning of the webpage. — ForceBru
– ForceBru, Commented Dec 14, 2017 at 14:48
Your regex starts with the ^ anchor, which means it's only going to match a currency value that starts at the very beginning of the document. — glibdud
– glibdud, Commented Dec 14, 2017 at 14:48

JCJ · Accepted Answer · 2017-12-14 15:04:40Z

2

Without knowing the text it has to search, you could use the regex:

([€|$|£]+[0-9a-zA-Z\,\.]+)

to capture everything that contains €, £ or $, and then print the amount without following words or letters. See the example in action here: http://rubular.com/r/a7O7AGF9Zl.

Using this regex we get this code:

import re
webpage = '''
one 
million
dollars
test123
$1bn asd
€5euro
$1923,1204bn
€1293.1205 million'''
regex = r'([€|$]+[0-9a-zA-Z\,\.]+)'
res = re.findall(regex, webpage)
print(res)

with the output:

['$1bn', '€5euro', '$1923,1204bn', '€1293.1205']

EDIT: Using the same regex on the provided website, it returns the output of:

['$131bn', '$100bn', '$17.4bn.', '$52.4bn']

If you modify the regex further to find e.g. 500million, you can add 0-9 to your first bracket, as you then search for either £, €, $ or anything that starts with 0-9.

Output of:

webpage = '''
one 
million
€1293.1205 million
500million
'''
regex = r'([€|$0-9]+[0-9a-zA-Z\,\.]+)'

Therefore becomes:

['€1293.1205', '500million']

edited Dec 14, 2017 at 15:04

answered Dec 14, 2017 at 14:48

JCJ

3033 silver badges13 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Chaz Over a year ago

This works, if I wanted to be able to find something such as 500mil dollars, how would I adapt your regex.

JCJ Over a year ago

I have updated my answer with a potential solution to that.

eLRuLL · Accepted Answer · 2017-12-14 14:49:23Z

0

the first error on your regex is the ^ at the beginning of the string, which will only match the first character on the string, which isn't helpful when using findall.

Also you are defining a lot of groups (()) , that I assume you don't really need, so escape all of them (adding ?: next to the opened parenthesis) and you are going to get very close to what you want:

regex = r'[$£€](?:(?:[\d]{1,3},(?:[\d]{3},)*[\d]{3}|[0-9]+)(?:\.[0-9][0-9])?(?:\s?bn|\s?mil|\s?euro[s]?|\s?dollar[s]?|\s?pound[s]?|p){0,2})'
res = re.findall(regex, webpage)
print(res)

answered Dec 14, 2017 at 14:49

eLRuLL

18.8k9 gold badges79 silver badges106 bronze badges

Comments

Ajax1234 · Accepted Answer · 2017-12-14 14:59:56Z

0

A webscraping solution:

import urllib
import itertools
from bs4 import BeautifulSoup as soup
import re
s = soup(str(urllib.urlopen('http://www.bbc.com/news/business-41779341').read()), 'lxml')
final_data = list(itertools.chain.from_iterable(filter(lambda x:x, [re.findall('[€\$£][\w\.]+', i.text) for i in s.findAll('p')])))

Output:

[u'$131bn', u'\xa3100bn', u'$100bn', u'$17.4bn.']

answered Dec 14, 2017 at 14:59

Ajax1234

71.7k9 gold badges67 silver badges110 bronze badges

Collectives™ on Stack Overflow

Python finding regex in a String

3 Answers 3

2 Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related