0

The pattern is as follows

page_pattern = 'manual-data-link" href="(.*?)"'

The matching function is as follows, where pattern is one of the predefined patterns like the above page_pattern

def get_pattern(pattern, string, group_num=1):
    escaped_pattern = re.escape(pattern)
    match = re.match(re.compile(escaped_pattern), string)

    if match:
        return match.group(group_num)
    else:
        return None

The problem is that match is always None, even though I made sure it works correctly with http://pythex.org/. I suspect I'm not compiling/escaping the pattern correctly.

Test string

<a class="rarity-5 set-102 manual-data-link" href="/data/123421" data-id="20886" data-type-id="295636317" >Data</a>
5
  • What are you trying to match ? Have you an example string please ? Commented Mar 31, 2015 at 2:01
  • @FlorianLefèvre added test string Commented Mar 31, 2015 at 2:07
  • And what's your escaped_pattern ? Commented Mar 31, 2015 at 2:08
  • @FlorianLefèvre it's the page_pattern above Commented Mar 31, 2015 at 2:10
  • No your escaped_pattern is manual\\-data\\-link\\"\\ href\\=\\"\\(\\.\\*\\?\\)\\". Not the same. Commented Mar 31, 2015 at 2:26

2 Answers 2

4

You have three problems.

1) You shouldn't call re.escape in this case. re.escape prevents special characters (like ., *, or ?) from having their special meanings. You want them to have special meanings here.

2) You should use re.search, not re.match re.match matches from the beginning of the string; you want to find a match anywhere inside the string.

3) You shouldn't parse HTML with regular expressions. Use a tool designed for the job, like BeautifulSoup.

Sign up to request clarification or add additional context in comments.

Comments

3

re.match tries to match from the beginning of the string. Since the string you're trying to match is at the middle, you need to use re.search instead of re.match

>>> import re
>>> s = '<a class="rarity-5 set-102 manual-data-link" href="/data/123421" data-id="20886" data-type-id="295636317" >Data</a>'
>>> re.search(r'manual-data-link" href="(.*?)"', s).group(1)
'/data/123421'

Use html parsers like BeautifulSoup to parse html files.

>>> from bs4 import BeautifulSoup
>>> s = '<a class="rarity-5 set-102 manual-data-link" href="/data/123421" data-id="20886" data-type-id="295636317" >Data</a>'
>>> soup = BeautifulSoup(s)
>>> for i in soup.find_all('a', class_=re.compile('.*manual-data-link')):
    print(i['href'])


/data/123421

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.