Python regex HTML

Question

I am going crazy over this, i hope someone can help me.

I am trying to regex this url: https://www.reddit.com/r/spacex/?count=50&after=t3_xxxxxxx where the x are numbers and letters.

The url is from an HTML file:

https://www.reddit.com/r/spacex/?count=25&after=t3_319905

I tried this:

re.search(r'(<a href=")(https://www.reddit.com/r/spacex/?count=25.+?)(")', subreddit).group(2)

but i keep getting NoneType' object has no attribute 'group'.

I would recommend looking into a scraper like Beautiful Soup. — TigerhawkT3
– TigerhawkT3, Commented Apr 8, 2015 at 0:20
yes yes i know, but for this i want to use regex. I am trying to learn why my regex is not working. — BubbleTea
– BubbleTea, Commented Apr 8, 2015 at 0:22
you have plenty of characters in there that have special meanings in regular expressions ... they need escaping — Julien Spronck
– Julien Spronck, Commented Apr 8, 2015 at 0:24
First extract (with beautiful soup as recommended) urls you are interested by using an XPath query to filter urls that begin with https://www.reddit.com/r/spacex/?count=25 and after extract with a regex (or an url parser) the part of the url you want. — Casimir et Hippolyte
– Casimir et Hippolyte, Commented Apr 8, 2015 at 0:24
@user2369869 Hi there, I'm /u/EchoLogic, one of the mods of /r/SpaceX. What are you trying to accomplish? I may already have done whatever you're trying to do. — marked-down
– marked-down, Commented Apr 8, 2015 at 0:35

Community · Accepted Answer · 2017-05-23 12:06:27Z

1

Use an HTML Parser, like BeautifulSoup. It provides you a way to specify a regular expression to match an attribute value:

soup.find_all('a', href=re.compile("after=t3_\w+"))

Working example:

import re
from bs4 import BeautifulSoup
import requests

url = "https://www.reddit.com/r/spacex/?count=25&after=t3_319905"
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.118 Safari/537.36'}
response = requests.get(url, headers=headers)

soup = BeautifulSoup(response.content)

print soup.find_all('a', href=re.compile("after=t3_\w+"))

Also see the must-provide link for regex+HTML questions:

RegEx match open tags except XHTML self-contained tags

edited May 23, 2017 at 12:06

CommunityBot

11 silver badge

answered Apr 8, 2015 at 0:27

alecxe

476k127 gold badges1.1k silver badges1.2k bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Avinash Raj · Accepted Answer · 2015-04-08 00:27:51Z

0

? is a special character in regex which makes the previous token as optional. You need to escape ? in the regex in-order to match a literal ? character. You need to escape the dots also but not the one in .+?.

re.search(r'(<a href=")(https://www\.reddit\.com/r/spacex/\?count=25.+?)(")', subreddit).group(2)
                                                          ^
                                                          |

Extra capturing groups are unnecessary here. Just a single capturing group would be enough.

re.search(r'<a href="(https://www\.reddit\.com/r/spacex/\?count=25.+?)"', subreddit).group(1)

edited Apr 8, 2015 at 0:27

answered Apr 8, 2015 at 0:26

Avinash Raj

175k32 gold badges247 silver badges289 bronze badges

3 Comments

BubbleTea Over a year ago

what is the difference between .+? and .+ or similarly .*? and .*

Avinash Raj Over a year ago

.* matches any character (except line breaks) zero or more times. .+ matches any character one or more times. Also see this question to learn about lazy and greedy.

BubbleTea Over a year ago

yeah but what makes it different when you add the ? after * or +

Collectives™ on Stack Overflow

Python regex HTML

2 Answers 2

Comments

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related