3

trying to improve the regex below:

urlpath=columns[4].strip()
                                urlpath=re.sub("(\?.*|\/[0-9a-f]{24})","",urlpath)
                                urlpath=re.sub("\/[0-9\/]*","/",urlpath)
                                urlpath=re.sub("\;.*","",urlpath)
                                urlpath=re.sub("\/",".",urlpath)
                                urlpath=re.sub("\.api","api",urlpath)
                                if urlpath in dlatency:

This transforms a URL like this:

/api/v4/path/apiCallTwo?host=wApp&trackId=1347158

to

api.v4.path.apiCallTwo

Would like to try and improve the regex as far as performance, as every 5 minutes this script runs across 50,000 files approximately and takes about 40 seconds overall to run.

thank you

5
  • 2
    Are you sure the regexes are the bottleneck in your script, and not, say, the harddisk? Commented Jun 5, 2012 at 15:04
  • Disk IO is fairly low. Script reads the log file in reverse line by line until it reaches a line thats over 5 minutes old. Commented Jun 5, 2012 at 15:18
  • 2
    Is this based on profiling the code or intuition? Commented Jun 5, 2012 at 15:32
  • iostat -kxd 2 shows very minimal Disk IO during the run of the script Commented Jun 6, 2012 at 20:47
  • This specific case is about URLs, so as others answered, you can solve it with other tools. I suffered from this regex slowness issue - waited more than 2 minutes for a substitution to end. Installed the package regex - works fast and great! you can download from here: pypi.python.org/pypi/regex Commented Apr 30, 2015 at 14:33

6 Answers 6

2

Try this:

s = '/api/v4/path/apiCallTwo?host=wApp&trackId=1347158'
re.sub(r'\?.+', '', s).replace('/', '.')[1:]
> 'api.v4.path.apiCallTwo'

For even better performance, compile once the regular expression and reuse it, like this:

regexp = re.compile(r'\?.+')
s = '/api/v4/path/apiCallTwo?host=wApp&trackId=1347158'

# `s` changes, but you can reuse `regexp` as many times as needed
regexp.sub('', s).replace('/', '.')[1:]

An even simpler approach, without using regular expressions:

s[1:s.index('?')].replace('/', '.')
> 'api.v4.path.apiCallTwo'
Sign up to request clarification or add additional context in comments.

3 Comments

The second approach fails if there is no '?'. Why reinventing the wheel ;)
@ms4py this is not about parsing URLs, it's about extracting and transforming text from an URL. Mind the unnecessary downvote?
It is about extracting the path of an URL and transforming it. And the preferable way is via parsing it so it works safely for every input.
2

One-liner with urlparse:

urlpath = urlparse.urlsplit(url).path.strip('/').replace('/', '.')

Comments

2

Here is my oneliner solution (edited).

urlpath.partition("?")[0].strip("/").replace("/", ".")

As some others have mentions, the speed improvements are negligible here. Aside from using re.compile() to precompile your expressions, I would start looking else where.

import re


re1 = re.compile("(\?.*|\/[0-9a-f]{24})")
re2 = re.compile("\/[0-9\/]*")
re3 = re.compile("\;.*")
re4 = re.compile("\/")
re5 = re.compile("\.api")
def orig_regex(urlpath):
    urlpath=re1.sub("",urlpath)
    urlpath=re2.sub("/",urlpath)
    urlpath=re3.sub("",urlpath)
    urlpath=re4.sub(".",urlpath)
    urlpath=re5.sub("api",urlpath)
    return urlpath


myregex = re.compile(r"([^/]+)")
def my_regex(urlpath):
    return ".".join( x.group() for x in myregex.finditer(urlpath.partition('?')[0]))

def test_nonregex(urlpath)
    return urlpath.partition("?")[0].strip("/").replace("/", ".")

def test_func(func, iterations, *args, **kwargs):
    for i in xrange(iterations):
        func(*args, **kwargs)

if __name__ == "__main__":
    import cProfile as profile

    urlpath = u'/api/v4/path/apiCallTwo?host=wApp&trackId=1347158'
    profile.run("test_func(orig_regex, 10000, urlpath)")
    profile.run("test_func(my_regex, 10000, urlpath)")
    profile.run("test_func(non_regex, 10000, urlpath)")

Results

Iterating orig_regex 10000 times
     60003 function calls in 0.108 CPU seconds

....

Iterating my_regex 10000 times
     130003 function calls in 0.087 CPU seconds

....

Iterating non_regex 10000 times
     40003 function calls in 0.019 CPU seconds

Without doing re.compile on your 5 regex results in

running <function orig_regex at 0x100532050> 10000 times
     210817 function calls (210794 primitive calls) in 0.208 CPU seconds

Comments

1

Going through the lines one by one:

You're not capturing or grouping, so the ( and ) aren't needed, and the / isn't a special character in Python's regex, so it doesn't need to be escaped:

urlpath = re.sub("\?.*|/[0-9a-f]{24}", "", urlpath)

Replacing a / followed by zero repeats of something with a / is pointless:

urlpath = re.sub("/[0-9/]+", "/", urlpath)

Removing a fixed character and everything after it is faster using a string method:

urlpath = urlpath.partition(";")[0]

Replacing a fixed string with another fixed string is also faster using a string method:

urlpath = urlpath.replace("/", ".")

urlpath = urlpath.replace(".api", "api")

Comments

0

You can also compile re statements to gain a performance boost,

e.g.

compiled_re_for_words = re.compile("\w+")
compiled_re_for_words.match("test")

Comments

0

Are you sure you need Regex for this?
I.e.,

urlpath = columns[4].strip()
urlpath = urlpath.split("?")[0]
urlpath = urlpath.replace("/", ".")

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.