0

I'm using python 3.3.0 in Windows 7.

I made this script to bypass http proxy without authentication on a system. But when I execute, it gives the error:UnicodeEncodeError: 'charmap' codec can't encode characters in position 6242-6243: character maps to <undefined> It seems that it fails to decode unicode characters into a string.

So, what should I use or edit/do? Do anybody have any clue or solution?

my .py contains following:

import sys, urllib
import urllib.request

url = "http://www.python.org"
proxies = {'http': 'http://199.91.174.6:3128/'}

opener = urllib.request.FancyURLopener(proxies)

try:
    f = urllib.request.urlopen(url)
except urllib.error.HTTPError as  e:
    print ("[!] The connection could not be established.")
    print ("[!] Error code: ",  e.code)
    sys.exit(1)
except urllib.error.URLError as  e:
    print ("[!] The connection could not be established.")
    print ("[!] Reason: ",  e.reason)
    sys.exit(1)

source = f.read()

if "iso-8859-1" in str(source):
    source = source.decode('iso-8859-1')
else:
    source = source.decode('utf-8')

print("\n SOURCE:\n",source)
3
  • 1
    You just published an IP of an open proxy. If this machine is yours I'd strongly suggest securing it properly. Commented Mar 3, 2013 at 18:06
  • yeah, it's an open proxy. Advice me more about this also. Thanks. Commented Mar 7, 2013 at 4:00
  • If you are the owner of this proxy, or know the owner: Use authentication, if you don't know who owns it: I would stop using it. Commented Mar 8, 2013 at 15:32

1 Answer 1

2
  1. This code doesn't even use your proxy
  2. This form of encoding detection is really weak. You should only look for the declared encoding in the well defined locations: HTTP header 'Content-Type' and if the response is HTML in the charset meta-tag.
  3. As you didn't include a stacktrace I assume the error happended in the line if "iso-8859-1" in str(source):. The call to str() decodes the bytes data using your systems default encoding (sys.getdefaultencoding()). If you really want to keep this check (see point 2) you should do if b"iso-8859-1" in source: This works on bytes instead of strings so no decoding has to be done beforehand.

Note: This code works fine for me, presumably because my system uses a default encoding of utf-8 while your windows system uses something different.

Update: I recommend using python-requests when doing http in python.

import requests

proxies = {'http': your_proxy_here}

with requests.Session(proxies=proxies) as sess:
    r = sess.get('http://httpbin.org/ip')
    print(r.apparent_encoding)
    print(r.text)
    # more requests

Note: this doesn't use the encoding specified in the HTML, you would need a HTML parser like beautifulsoup to extract that.

Sign up to request clarification or add additional context in comments.

3 Comments

sorry for late reply from me. I was out of town. Thanks for detailed answer. Please help me to sort out all the points you have mentioned. Please give me code/example, so I can have better idea.
My system also has utf-8 default encoding. Would please tell me why this code will not use proxy? Because I have seen this code in python documents itself!
Hmm, I tried this: b"iso-8859-1" in source: But it's also not working!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.