improve python regex performance

Question

trying to improve the regex below:

urlpath=columns[4].strip()
                                urlpath=re.sub("(\?.*|\/[0-9a-f]{24})","",urlpath)
                                urlpath=re.sub("\/[0-9\/]*","/",urlpath)
                                urlpath=re.sub("\;.*","",urlpath)
                                urlpath=re.sub("\/",".",urlpath)
                                urlpath=re.sub("\.api","api",urlpath)
                                if urlpath in dlatency:

This transforms a URL like this:

/api/v4/path/apiCallTwo?host=wApp&trackId=1347158

to

api.v4.path.apiCallTwo

Would like to try and improve the regex as far as performance, as every 5 minutes this script runs across 50,000 files approximately and takes about 40 seconds overall to run.

thank you

Are you sure the regexes are the bottleneck in your script, and not, say, the harddisk? — Fred Foo
– Fred Foo, Commented Jun 5, 2012 at 15:04
Disk IO is fairly low. Script reads the log file in reverse line by line until it reaches a line thats over 5 minutes old. — coderwhiz
– coderwhiz, Commented Jun 5, 2012 at 15:18
iostat -kxd 2 shows very minimal Disk IO during the run of the script — coderwhiz
– coderwhiz, Commented Jun 6, 2012 at 20:47
This specific case is about URLs, so as others answered, you can solve it with other tools. I suffered from this regex slowness issue - waited more than 2 minutes for a substitution to end. Installed the package regex - works fast and great! you can download from here: pypi.python.org/pypi/regex — SomethingSomething
– SomethingSomething, Commented Apr 30, 2015 at 14:33

Óscar López · Accepted Answer · 2012-06-05 15:22:56Z

2

Try this:

s = '/api/v4/path/apiCallTwo?host=wApp&trackId=1347158'
re.sub(r'\?.+', '', s).replace('/', '.')[1:]
> 'api.v4.path.apiCallTwo'

For even better performance, compile once the regular expression and reuse it, like this:

regexp = re.compile(r'\?.+')
s = '/api/v4/path/apiCallTwo?host=wApp&trackId=1347158'

# `s` changes, but you can reuse `regexp` as many times as needed
regexp.sub('', s).replace('/', '.')[1:]

An even simpler approach, without using regular expressions:

s[1:s.index('?')].replace('/', '.')
> 'api.v4.path.apiCallTwo'

edited Jun 5, 2012 at 15:22

answered Jun 5, 2012 at 15:10

Óscar López

237k38 gold badges321 silver badges391 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

schlamar Over a year ago

The second approach fails if there is no '?'. Why reinventing the wheel ;)

Óscar López Over a year ago

@ms4py this is not about parsing URLs, it's about extracting and transforming text from an URL. Mind the unnecessary downvote?

schlamar Over a year ago

It is about extracting the path of an URL and transforming it. And the preferable way is via parsing it so it works safely for every input.

schlamar · Accepted Answer · 2012-06-05 15:59:04Z

2

One-liner with urlparse:

urlpath = urlparse.urlsplit(url).path.strip('/').replace('/', '.')

edited Jun 5, 2012 at 15:59

schlamar

9,5593 gold badges43 silver badges77 bronze badges

answered Jun 5, 2012 at 15:23

badzil

3,6204 gold badges23 silver badges29 bronze badges

Comments

jlujan · Accepted Answer · 2012-06-05 19:27:19Z

Here is my oneliner solution (edited).

urlpath.partition("?")[0].strip("/").replace("/", ".")

As some others have mentions, the speed improvements are negligible here. Aside from using re.compile() to precompile your expressions, I would start looking else where.

import re


re1 = re.compile("(\?.*|\/[0-9a-f]{24})")
re2 = re.compile("\/[0-9\/]*")
re3 = re.compile("\;.*")
re4 = re.compile("\/")
re5 = re.compile("\.api")
def orig_regex(urlpath):
    urlpath=re1.sub("",urlpath)
    urlpath=re2.sub("/",urlpath)
    urlpath=re3.sub("",urlpath)
    urlpath=re4.sub(".",urlpath)
    urlpath=re5.sub("api",urlpath)
    return urlpath


myregex = re.compile(r"([^/]+)")
def my_regex(urlpath):
    return ".".join( x.group() for x in myregex.finditer(urlpath.partition('?')[0]))

def test_nonregex(urlpath)
    return urlpath.partition("?")[0].strip("/").replace("/", ".")

def test_func(func, iterations, *args, **kwargs):
    for i in xrange(iterations):
        func(*args, **kwargs)

if __name__ == "__main__":
    import cProfile as profile

    urlpath = u'/api/v4/path/apiCallTwo?host=wApp&trackId=1347158'
    profile.run("test_func(orig_regex, 10000, urlpath)")
    profile.run("test_func(my_regex, 10000, urlpath)")
    profile.run("test_func(non_regex, 10000, urlpath)")

Results

Iterating orig_regex 10000 times
     60003 function calls in 0.108 CPU seconds

....

Iterating my_regex 10000 times
     130003 function calls in 0.087 CPU seconds

....

Iterating non_regex 10000 times
     40003 function calls in 0.019 CPU seconds

Without doing re.compile on your 5 regex results in

running <function orig_regex at 0x100532050> 10000 times
     210817 function calls (210794 primitive calls) in 0.208 CPU seconds

MRAB · Accepted Answer · 2012-06-07 23:22:09Z

1

Going through the lines one by one:

You're not capturing or grouping, so the ( and ) aren't needed, and the / isn't a special character in Python's regex, so it doesn't need to be escaped:

urlpath = re.sub("\?.*|/[0-9a-f]{24}", "", urlpath)

Replacing a / followed by zero repeats of something with a / is pointless:

urlpath = re.sub("/[0-9/]+", "/", urlpath)

Removing a fixed character and everything after it is faster using a string method:

urlpath = urlpath.partition(";")[0]

Replacing a fixed string with another fixed string is also faster using a string method:

urlpath = urlpath.replace("/", ".")

urlpath = urlpath.replace(".api", "api")

answered Jun 7, 2012 at 23:22

MRAB

20.7k6 gold badges44 silver badges34 bronze badges

Comments

Jakob Bowyer · Accepted Answer · 2012-06-05 15:13:18Z

0

You can also compile re statements to gain a performance boost,

e.g.

compiled_re_for_words = re.compile("\w+")
compiled_re_for_words.match("test")

answered Jun 5, 2012 at 15:13

Jakob Bowyer

34.8k8 gold badges80 silver badges91 bronze badges

Comments

user1417475 · Accepted Answer · 2012-06-05 15:18:41Z

0

Are you sure you need Regex for this?
I.e.,

urlpath = columns[4].strip()
urlpath = urlpath.split("?")[0]
urlpath = urlpath.replace("/", ".")

answered Jun 5, 2012 at 15:18

user1417475

2361 silver badge8 bronze badges

Collectives™ on Stack Overflow

improve python regex performance

6 Answers 6

3 Comments

Comments

Comments

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

6 Answers 6

3 Comments

Comments

Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related