Python regexp error

Question

I was looking for URL regexp in python, after reading stack overflow I've decided to take this one: http://daringfireball.net/2010/07/improved_regex_for_matching_urls and use it in my python code.

I've put in something like this:

reg_url =
re.compile(r"""((?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.‌][a-z]{2,4}/)(?:[^\s()<>]+|(([^\s()<>]+|(([^\s()<>]+)))\*))+(?:(([^\s()<>]+|(‌([^\s()<>]+)))\*)|[^\s`!()[]{};:`".,<>?«»“”‘’]))""",
re.DOTALL)

(Python 2.7)

After running my code with that regexp I am getting following error:

SyntaxError: Non-ASCII character '\xe2' in file file.py on line 60, but no encoding declared; see http://www.python.org/peps/pep-0263.html for details

What is the best way to address this issue?

You have curly quotes and non-ASCII characters in there: «»“”‘’. Read the PEP. — Blender
– Blender, Commented Mar 29, 2013 at 2:56

cge · Accepted Answer · 2013-03-29 03:52:33Z

1

Python, has problems (in 2, not 3) with regards to input encodings, and defaults to ASCII encoding in source code. Add a comment on the first or second line of your file along the lines of # encoding: utf-8, and you'll fix this issue. The PEP linked in your error message does a good job of explaining this.

However, it's worth noting that your regexp doesn't work for me, while simply copying the one from the site you link to, which seems very different, does work. Have you considered the possibility of using urlparse?

If you really do want to use a regex, note the following:

regex_a= re.compile(r"(?i)\b((?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'\".,<>?«»“”‘’]))")
regex_b = re.compile(r"""((?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.‌][a-z]{2,4}/)(?:[^\s()<>]+|(([^\s()<>]+|(([^\s()<>]+)))\*))+(?:(([^\s()<>]+|(‌([^\s()<>]+)))\*)|[^\s`!()[]{};:`".,<>?«»“”‘’]))""", re.DOTALL)

regex_a.match("http://www.www.com/thisisatest") # returns a match object
#regex_b.match("http://www.www.com/thisisatest") # edit: actually, this just hangs...

There appear to be a number of braces, parenthesis and brackets that have had their escaping removed in your version, as well as U+200C and U+200B characters in odd places.

edited Mar 29, 2013 at 3:52

answered Mar 29, 2013 at 3:35

cge

10k3 gold badges36 silver badges52 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

user109074 Over a year ago

I am getting same error with both regexp_a and b, encoding issue is still there.

cge Over a year ago

Did you put # encoding: utf-8 as either the first or second line of your source code file? Note that it must be on the first or second line; it can't simply be anywhere.

user109074 Over a year ago

Encoding, was my issue. Thanks

Collectives™ on Stack Overflow

Python regexp error

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related