0

I was looking for URL regexp in python, after reading stack overflow I've decided to take this one: http://daringfireball.net/2010/07/improved_regex_for_matching_urls and use it in my python code.

I've put in something like this:

reg_url =
re.compile(r"""((?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.‌​][a-z]{2,4}/)(?:[^\s()<>]+|(([^\s()<>]+|(([^\s()<>]+)))\*))+(?:(([^\s()<>]+|(‌​([^\s()<>]+)))\*)|[^\s`!()[]{};:`".,<>?«»“”‘’]))""",
re.DOTALL)

(Python 2.7)

After running my code with that regexp I am getting following error:

SyntaxError: Non-ASCII character '\xe2' in file file.py on line 60, but no encoding declared; see http://www.python.org/peps/pep-0263.html for details

What is the best way to address this issue?

1
  • 5
    You have curly quotes and non-ASCII characters in there: «»“”‘’. Read the PEP. Commented Mar 29, 2013 at 2:56

1 Answer 1

1

Python, has problems (in 2, not 3) with regards to input encodings, and defaults to ASCII encoding in source code. Add a comment on the first or second line of your file along the lines of # encoding: utf-8, and you'll fix this issue. The PEP linked in your error message does a good job of explaining this.

However, it's worth noting that your regexp doesn't work for me, while simply copying the one from the site you link to, which seems very different, does work. Have you considered the possibility of using urlparse?

If you really do want to use a regex, note the following:

regex_a= re.compile(r"(?i)\b((?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'\".,<>?«»“”‘’]))")
regex_b = re.compile(r"""((?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.‌​][a-z]{2,4}/)(?:[^\s()<>]+|(([^\s()<>]+|(([^\s()<>]+)))\*))+(?:(([^\s()<>]+|(‌​([^\s()<>]+)))\*)|[^\s`!()[]{};:`".,<>?«»“”‘’]))""", re.DOTALL)

regex_a.match("http://www.www.com/thisisatest") # returns a match object
#regex_b.match("http://www.www.com/thisisatest") # edit: actually, this just hangs...

There appear to be a number of braces, parenthesis and brackets that have had their escaping removed in your version, as well as U+200C and U+200B characters in odd places.

Sign up to request clarification or add additional context in comments.

3 Comments

I am getting same error with both regexp_a and b, encoding issue is still there.
Did you put # encoding: utf-8 as either the first or second line of your source code file? Note that it must be on the first or second line; it can't simply be anywhere.
Encoding, was my issue. Thanks

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.