0

I am writing a small program to fetch all hyperlinks from a webpage by providing a URL, but it seem like the network I am in is using proxy and it is not able to fetch .. My code:

import sys
import urllib
import urlparse

from bs4 import BeautifulSoup
def process(url):
    page = urllib.urlopen(url) 
    text = page.read()
    page.close()
    soup = BeautifulSoup(text) 
    with open('s.txt','w') as file:
        for tag in soup.findAll('a', href=True):
            tag['href'] = urlparse.urljoin(url, tag['href'])
            print tag['href']
            file.write('\n')
            file.write(tag['href'])


def main():
    if len(sys.argv) == 1:
        print 'No url !!'
        sys.exit(1)
    for url in sys.argv[1:]:
        process(url)
6
  • Based on your question your network may or may not have a proxy in use. Can you be a little more specific or just pass by your admins and ask? Commented Sep 22, 2015 at 8:59
  • yes , it have a proxy ,i tried at home it was working fine but when i took it to my Department to show to my teacher it dint work ...this is the error IOError: [Errno socket error] [Errno -2] Name or service not known Commented Sep 22, 2015 at 11:05
  • this is the proxy i used too connect "proxy4.nehu.ac.in:3128" how do i put it in codes in my program ..? please help , i am so stuck with it . Commented Sep 22, 2015 at 11:22
  • ok i will check on this and i will come back to you if i encounter some problem ..at this moment i cannot test it because i have to try it at the University itself since i dont have proxy network to test . If it ok with you? Commented Sep 22, 2015 at 11:56
  • you can easily set up a proxy on your own. E.g. squid is quiet popular. Commented Sep 22, 2015 at 11:57

2 Answers 2

3

You could use the requests module instead.

import requests

proxies = { 'http': 'http://host/' } 
# or if it requires authentication 'http://user:pass@host/' instead

r = requests.get(url, proxies=proxies)
text = r.text
Sign up to request clarification or add additional context in comments.

7 Comments

should i put it this way proxies = { 'http': 'http://proxya4.nehu.ac.in }
You need the port and closing quote. So it would be proxies = { 'http': 'http://proxya4.nehu.ac.in:3128' }
Can i come back to you later i will try first an let u know how it goes?..i really want this to work ..im like crying inside so bad.
Hi, i tried your suggestion i got 'response 200' when i print r=requests.get("http://www.dota2.com",proxies=poxies) what does it means.
200 is the status code for the response. It is saying the response was ok. [1] To get the html from the page, you need to print r.text. [1]: w3.org/Protocols/rfc2616/rfc2616-sec10.html
|
1

The urllib library you are using for HTTP access does not support proxy authentication (it does support un-authenticated proxies). From the docs:

Proxies which require authentication for use are not currently supported; this is considered an implementation limitation.

I suggest you switch to urllib2 and use it as demonstrated in the answer to this post.

2 Comments

I am new to python so its hard for me to implement , just for the head start can u like somehow show me how should i put it in my program ..?
i have read in the python documentation that there is a proxyHandler in urllib2 that can handle proxy , how to i put it in such a way that it will go through the proxy i used to connect to the internet.Please help

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.