Generating URLs in Python?

Question

I'm trying to get all the links to the articles (which happen to have the class 'title may-blank' to denote them). I'm trying to figure out why the code below generates a whole bunch of "href=" when I run it, instead of returning with the actual URL. I also get a bunch of random text and links after the failed 25 article URLs (all 'href='), but not sure why that's happening since it should stop after it stop finding the class 'title may-blank'. Can you guys help me find out what's wrong?

import urllib2

def get_page(page):

    response = urllib2.urlopen(page)
    html = response.read()
    p = str(html)
    return p

def get_next_target(page):
    start_link = page.find('title may-blank')
    start_quote = page.find('"', start_link + 4)
    end_quote = page.find ('"', start_quote + 1)
    aurl = page[start_quote+1:end_quote] # Gets Article URL
    return aurl, end_quote

def print_all_links(page):
    while True:
        aurl, endpos = get_next_target(page)
        if aurl:
            print("%s" % (aurl))
            print("")
            page = page[endpos:]
        else:
            break

reddit_url = 'http://www.reddit.com/r/worldnews'

print_all_links(get_page(reddit_url))

Why don't you use something like BeautifulSoup (crummy.com/software/BeautifulSoup) to scrape the links? — tttthomasssss
– tttthomasssss, Commented Sep 2, 2014 at 8:05

Community · Accepted Answer · 2017-05-23 12:32:11Z

1

Rawing is correct, but when I face an XY problem I prefer to provide the best way to accomplish X instead of a way to fix Y. You should use an HTML parser like BeautifulSoup to parse webpages:

from bs4 import BeautifulSoup
import urllib2

def print_all_links(page):
    html = urllib2.urlopen(page).read()
    soup = BeautifulSoup(html)
    for a in soup.find_all('a', 'title may-blank ', href=True):
        print(a['href'])

If you are really allergic to HTML parser, at least use regex (even if you should stick to HTML parsing):

import urllib2
import re

def print_all_links(page):
    html = urllib2.urlopen(page).read()
    for href in re.findall(r'<a class="title may-blank " href="(.*?)"', html):
        print(href)

edited May 23, 2017 at 12:32

CommunityBot

11 silver badge

answered Sep 2, 2014 at 8:09

enrico.bacis

31.9k10 gold badges90 silver badges116 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Aran-Fey · Accepted Answer · 2014-09-02 08:13:17Z

0

That's because the line

start_quote = page.find('"', start_link + 4)

doesn't do what you think it does. start_link is set to the index of "title may-blank". So, if you do a page.find at start_link+4, you actually start searching at "e may-blank". If you change

start_quote = page.find('"', start_link + 4)

to

start_quote = page.find('"', start_link + len('title may-blank') + 1)

it'll work.

answered Sep 2, 2014 at 8:13

Aran-Fey

44k13 gold badges113 silver badges161 bronze badges

Collectives™ on Stack Overflow

Generating URLs in Python?

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related