Regex matching specific HTML string with Python

Question

The pattern is as follows

page_pattern = 'manual-data-link" href="(.*?)"'

The matching function is as follows, where pattern is one of the predefined patterns like the above page_pattern

def get_pattern(pattern, string, group_num=1):
    escaped_pattern = re.escape(pattern)
    match = re.match(re.compile(escaped_pattern), string)

    if match:
        return match.group(group_num)
    else:
        return None

The problem is that match is always None, even though I made sure it works correctly with http://pythex.org/. I suspect I'm not compiling/escaping the pattern correctly.

Test string

<a class="rarity-5 set-102 manual-data-link" href="/data/123421" data-id="20886" data-type-id="295636317" >Data</a>

What are you trying to match ? Have you an example string please ? — fdglefevre
– fdglefevre, Commented Mar 31, 2015 at 2:01
No your escaped_pattern is manual\\-data\\-link\\"\\ href\\=\\"\\(\\.\\*\\?\\)\\". Not the same. — fdglefevre
– fdglefevre, Commented Mar 31, 2015 at 2:26

Community · Accepted Answer · 2017-05-23 12:06:20Z

4

You have three problems.

1) You shouldn't call re.escape in this case. re.escape prevents special characters (like ., *, or ?) from having their special meanings. You want them to have special meanings here.

2) You should use re.search, not re.match re.match matches from the beginning of the string; you want to find a match anywhere inside the string.

3) You shouldn't parse HTML with regular expressions. Use a tool designed for the job, like BeautifulSoup.

edited May 23, 2017 at 12:06

CommunityBot

11 silver badge

answered Mar 31, 2015 at 2:10

Robᵩ

170k20 gold badges251 silver badges323 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Avinash Raj · Accepted Answer · 2015-03-31 02:20:47Z

3

re.match tries to match from the beginning of the string. Since the string you're trying to match is at the middle, you need to use re.search instead of re.match

>>> import re
>>> s = '<a class="rarity-5 set-102 manual-data-link" href="/data/123421" data-id="20886" data-type-id="295636317" >Data</a>'
>>> re.search(r'manual-data-link" href="(.*?)"', s).group(1)
'/data/123421'

Use html parsers like BeautifulSoup to parse html files.

>>> from bs4 import BeautifulSoup
>>> s = '<a class="rarity-5 set-102 manual-data-link" href="/data/123421" data-id="20886" data-type-id="295636317" >Data</a>'
>>> soup = BeautifulSoup(s)
>>> for i in soup.find_all('a', class_=re.compile('.*manual-data-link')):
    print(i['href'])


/data/123421

edited Mar 31, 2015 at 2:20

answered Mar 31, 2015 at 2:10

Avinash Raj

175k32 gold badges247 silver badges289 bronze badges

Collectives™ on Stack Overflow

Regex matching specific HTML string with Python

2 Answers 2

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related