Python re string parsing

Question

I'm trying to use re patterns within scrapy to parse a string. The string is of the format below. I am trying to retrieve the numbers within the font tags (e.g. 08:00). Easy enough to do in one list (\d+:\d+)+ but I need two separate lists of AM and PM. Can you only do this by creating two substrings - AM and PM - and then running the pattern against each of the substrings? The (AM - and (PM - are unique. It feels like you should be able to do it directly but I'm out of ideas. Thanks.

example input:

(AM – 07:00 <font color=#0002fe>08:00</font> <font color=#0000dd>09:00</font> <font color=#0001fe>10:100</font>) <br> (PM – 18:00 <font color=#0000fe>190:00</font> <font color=#0000fe>175:00</font>)

Thank you for your replies. I'm afraid I wasn't clear enough in my original post. The string provided is a sample but is part of a larger string that contains many other tags inc. <br> tags so splitting on tags in the way suggested isn't an option. Regarding BeautifulSoup, I haven't used it so I think it may be just easier, for me, to use re to extract the two sections into substrings and parse them as indicated. Thanks again. — john
– john, Commented Apr 22, 2016 at 14:29

alecxe · Accepted Answer · 2016-04-22 12:57:08Z

3

I would first eliminate the HTML tags and get the plain text to work with. For that, you can use an HTML parser, like BeautifulSoup:

>>> from bs4 import BeautifulSoup
>>> data = '(AM – 07:00 <font color=#0002fe>08:00</font> <font color=#0000dd>09:00</font> <font color=#0001fe>10:100</font>) <br> (PM – 18:00 <font color=#0000fe>190:00</font> <font color=#0000fe>175:00</font>)'
>>> soup = BeautifulSoup(data, "html.parser")
>>> data = soup.get_text()
>>> AM, PM = data.split("  ")
>>> AM
u'(AM \u2013 07:00 08:00 09:00 10:100)'
>>> PM
u'(PM \u2013 18:00 190:00 175:00)'

answered Apr 22, 2016 at 12:57

alecxe

476k127 gold badges1.1k silver badges1.2k bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

dimo414 Over a year ago

Rather than calling get_text() on the whole input, why not split on the <br> tag itself?

alecxe Over a year ago

@dimo414 that's a good point. I am just afraid to overcomplicate the problem for the OP and decided to just show the starting point to make the data more convenient for the task..thank you.

Trev Davies · Accepted Answer · 2016-04-22 13:10:39Z

1

If your string is always going to look like the example then you can do this using the following regex:

import re
capture = re.compile("(?<=>)[\d:]*(?=<)")
res = capture.findall("(AM – 07:00 <font color=#0002fe>08:00</font> <font color=#0000dd>09:00</font> <font color=#0001fe>10:100</font>) <br> (PM – 18:00 <font color=#0000fe>190:00</font> <font color=#0000fe>175:00</font>)")
for match in res:
    print(match)

This won't work if you have other types of tags in there though, as it just finds everything between > and < with no spaces.

Result:

answered Apr 22, 2016 at 13:10

Trev Davies

3911 silver badge10 bronze badges

Collectives™ on Stack Overflow

Python re string parsing

2 Answers 2

2 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related