0

I'm trying to use re patterns within scrapy to parse a string. The string is of the format below. I am trying to retrieve the numbers within the font tags (e.g. 08:00). Easy enough to do in one list (\d+:\d+)+ but I need two separate lists of AM and PM. Can you only do this by creating two substrings - AM and PM - and then running the pattern against each of the substrings? The (AM - and (PM - are unique. It feels like you should be able to do it directly but I'm out of ideas. Thanks.

example input:

(AM – 07:00 <font color=#0002fe>08:00</font> <font color=#0000dd>09:00</font> <font color=#0001fe>10:100</font>) <br> (PM – 18:00 <font color=#0000fe>190:00</font> <font color=#0000fe>175:00</font>) 
1
  • Thank you for your replies. I'm afraid I wasn't clear enough in my original post. The string provided is a sample but is part of a larger string that contains many other tags inc. <br> tags so splitting on tags in the way suggested isn't an option. Regarding BeautifulSoup, I haven't used it so I think it may be just easier, for me, to use re to extract the two sections into substrings and parse them as indicated. Thanks again. Commented Apr 22, 2016 at 14:29

2 Answers 2

3

I would first eliminate the HTML tags and get the plain text to work with. For that, you can use an HTML parser, like BeautifulSoup:

>>> from bs4 import BeautifulSoup
>>> data = '(AM – 07:00 <font color=#0002fe>08:00</font> <font color=#0000dd>09:00</font> <font color=#0001fe>10:100</font>) <br> (PM – 18:00 <font color=#0000fe>190:00</font> <font color=#0000fe>175:00</font>)'
>>> soup = BeautifulSoup(data, "html.parser")
>>> data = soup.get_text()
>>> AM, PM = data.split("  ")
>>> AM
u'(AM \u2013 07:00 08:00 09:00 10:100)'
>>> PM
u'(PM \u2013 18:00 190:00 175:00)'
Sign up to request clarification or add additional context in comments.

2 Comments

Rather than calling get_text() on the whole input, why not split on the <br> tag itself?
@dimo414 that's a good point. I am just afraid to overcomplicate the problem for the OP and decided to just show the starting point to make the data more convenient for the task..thank you.
1

If your string is always going to look like the example then you can do this using the following regex:

import re
capture = re.compile("(?<=>)[\d:]*(?=<)")
res = capture.findall("(AM – 07:00 <font color=#0002fe>08:00</font> <font color=#0000dd>09:00</font> <font color=#0001fe>10:100</font>) <br> (PM – 18:00 <font color=#0000fe>190:00</font> <font color=#0000fe>175:00</font>)")
for match in res:
    print(match)

This won't work if you have other types of tags in there though, as it just finds everything between > and < with no spaces.

Result:

08:00
09:00
10:100
190:00
175:00

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.