0

I'm trying to get all the links to the articles (which happen to have the class 'title may-blank' to denote them). I'm trying to figure out why the code below generates a whole bunch of "href=" when I run it, instead of returning with the actual URL. I also get a bunch of random text and links after the failed 25 article URLs (all 'href='), but not sure why that's happening since it should stop after it stop finding the class 'title may-blank'. Can you guys help me find out what's wrong?

import urllib2

def get_page(page):

    response = urllib2.urlopen(page)
    html = response.read()
    p = str(html)
    return p

def get_next_target(page):
    start_link = page.find('title may-blank')
    start_quote = page.find('"', start_link + 4)
    end_quote = page.find ('"', start_quote + 1)
    aurl = page[start_quote+1:end_quote] # Gets Article URL
    return aurl, end_quote

def print_all_links(page):
    while True:
        aurl, endpos = get_next_target(page)
        if aurl:
            print("%s" % (aurl))
            print("")
            page = page[endpos:]
        else:
            break

reddit_url = 'http://www.reddit.com/r/worldnews'

print_all_links(get_page(reddit_url))
1

2 Answers 2

1

Rawing is correct, but when I face an XY problem I prefer to provide the best way to accomplish X instead of a way to fix Y. You should use an HTML parser like BeautifulSoup to parse webpages:

from bs4 import BeautifulSoup
import urllib2

def print_all_links(page):
    html = urllib2.urlopen(page).read()
    soup = BeautifulSoup(html)
    for a in soup.find_all('a', 'title may-blank ', href=True):
        print(a['href'])

If you are really allergic to HTML parser, at least use regex (even if you should stick to HTML parsing):

import urllib2
import re

def print_all_links(page):
    html = urllib2.urlopen(page).read()
    for href in re.findall(r'<a class="title may-blank " href="(.*?)"', html):
        print(href)
Sign up to request clarification or add additional context in comments.

Comments

0

That's because the line

start_quote = page.find('"', start_link + 4)

doesn't do what you think it does. start_link is set to the index of "title may-blank". So, if you do a page.find at start_link+4, you actually start searching at "e may-blank". If you change

start_quote = page.find('"', start_link + 4)

to

start_quote = page.find('"', start_link + len('title may-blank') + 1)

it'll work.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.