0

I am going crazy over this, i hope someone can help me.

I am trying to regex this url: https://www.reddit.com/r/spacex/?count=50&after=t3_xxxxxxx where the x are numbers and letters.

The url is from an HTML file:

https://www.reddit.com/r/spacex/?count=25&after=t3_319905

I tried this:

re.search(r'(<a href=")(https://www.reddit.com/r/spacex/?count=25.+?)(")', subreddit).group(2)

but i keep getting NoneType' object has no attribute 'group'.

7
  • 1
    I would recommend looking into a scraper like Beautiful Soup. Commented Apr 8, 2015 at 0:20
  • yes yes i know, but for this i want to use regex. I am trying to learn why my regex is not working. Commented Apr 8, 2015 at 0:22
  • you have plenty of characters in there that have special meanings in regular expressions ... they need escaping Commented Apr 8, 2015 at 0:24
  • First extract (with beautiful soup as recommended) urls you are interested by using an XPath query to filter urls that begin with https://www.reddit.com/r/spacex/?count=25 and after extract with a regex (or an url parser) the part of the url you want. Commented Apr 8, 2015 at 0:24
  • @user2369869 Hi there, I'm /u/EchoLogic, one of the mods of /r/SpaceX. What are you trying to accomplish? I may already have done whatever you're trying to do. Commented Apr 8, 2015 at 0:35

2 Answers 2

1

Use an HTML Parser, like BeautifulSoup. It provides you a way to specify a regular expression to match an attribute value:

soup.find_all('a', href=re.compile("after=t3_\w+"))

Working example:

import re
from bs4 import BeautifulSoup
import requests

url = "https://www.reddit.com/r/spacex/?count=25&after=t3_319905"
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.118 Safari/537.36'}
response = requests.get(url, headers=headers)

soup = BeautifulSoup(response.content)

print soup.find_all('a', href=re.compile("after=t3_\w+"))

Also see the must-provide link for regex+HTML questions:

Sign up to request clarification or add additional context in comments.

Comments

0

? is a special character in regex which makes the previous token as optional. You need to escape ? in the regex in-order to match a literal ? character. You need to escape the dots also but not the one in .+?.

re.search(r'(<a href=")(https://www\.reddit\.com/r/spacex/\?count=25.+?)(")', subreddit).group(2)
                                                          ^
                                                          |

Extra capturing groups are unnecessary here. Just a single capturing group would be enough.

re.search(r'<a href="(https://www\.reddit\.com/r/spacex/\?count=25.+?)"', subreddit).group(1)

3 Comments

what is the difference between .+? and .+ or similarly .*? and .*
.* matches any character (except line breaks) zero or more times. .+ matches any character one or more times. Also see this question to learn about lazy and greedy.
yeah but what makes it different when you add the ? after * or +

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.