1

I'm trying to find all cases of money values in a string called webpage.

String webpage is the text from this webpage, in my program it's just hardcoded because that's all that is needed, but I won't paste it all here.

regex = r'^[$£€]?(([\d]{1,3},([\d]{3},)*[\d]{3}|[0-9]+)(\.[0-9][0-9])?(\s?bn|\s?mil|\s?euro[s]?|\s?dollar[s]?|\s?pound[s]?|p){0,2})'
res = re.findall(regex, webpage)
print(res)

it's returning [], I expected it to return [$131bn, £100bn, $100bn, $17.4bn]

6
  • 1
    What's the contents of webpage? Commented Dec 14, 2017 at 14:31
  • so, you should NOT parse a web-page with regex. There are other good proper tools Commented Dec 14, 2017 at 14:33
  • My answer here (stackoverflow.com/a/37571199/2064981) might help you ;) Commented Dec 14, 2017 at 14:33
  • 2
    There is no way this regex will match anything as it matches only at the beginning of the string, because of '^stuff'. So it looks like you don't want to match at the very beginning of the webpage. Commented Dec 14, 2017 at 14:48
  • 2
    Your regex starts with the ^ anchor, which means it's only going to match a currency value that starts at the very beginning of the document. Commented Dec 14, 2017 at 14:48

3 Answers 3

2

Without knowing the text it has to search, you could use the regex:

([€|$|£]+[0-9a-zA-Z\,\.]+)

to capture everything that contains €, £ or $, and then print the amount without following words or letters. See the example in action here: http://rubular.com/r/a7O7AGF9Zl.

Using this regex we get this code:

import re
webpage = '''
one 
million
dollars
test123
$1bn asd
€5euro
$1923,1204bn
€1293.1205 million'''
regex = r'([€|$]+[0-9a-zA-Z\,\.]+)'
res = re.findall(regex, webpage)
print(res)

with the output:

['$1bn', '€5euro', '$1923,1204bn', '€1293.1205']

EDIT: Using the same regex on the provided website, it returns the output of:

['$131bn', '$100bn', '$17.4bn.', '$52.4bn']

If you modify the regex further to find e.g. 500million, you can add 0-9 to your first bracket, as you then search for either £, €, $ or anything that starts with 0-9.

Output of:

webpage = '''
one 
million
€1293.1205 million
500million
'''
regex = r'([€|$0-9]+[0-9a-zA-Z\,\.]+)'

Therefore becomes:

['€1293.1205', '500million']
Sign up to request clarification or add additional context in comments.

2 Comments

This works, if I wanted to be able to find something such as 500mil dollars, how would I adapt your regex.
I have updated my answer with a potential solution to that.
0

the first error on your regex is the ^ at the beginning of the string, which will only match the first character on the string, which isn't helpful when using findall.

Also you are defining a lot of groups (()) , that I assume you don't really need, so escape all of them (adding ?: next to the opened parenthesis) and you are going to get very close to what you want:

regex = r'[$£€](?:(?:[\d]{1,3},(?:[\d]{3},)*[\d]{3}|[0-9]+)(?:\.[0-9][0-9])?(?:\s?bn|\s?mil|\s?euro[s]?|\s?dollar[s]?|\s?pound[s]?|p){0,2})'
res = re.findall(regex, webpage)
print(res)

Comments

0

A webscraping solution:

import urllib
import itertools
from bs4 import BeautifulSoup as soup
import re
s = soup(str(urllib.urlopen('http://www.bbc.com/news/business-41779341').read()), 'lxml')
final_data = list(itertools.chain.from_iterable(filter(lambda x:x, [re.findall('[€\$£][\w\.]+', i.text) for i in s.findAll('p')])))

Output:

[u'$131bn', u'\xa3100bn', u'$100bn', u'$17.4bn.']

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.