Python: Faster regex replace

Question

I have a large set of large files and a set of "phrases" that need to be replaced in each file.
The "business logic" imposes several restrictions:

Matching must be case-insensitive
The whitespace, tabs and new lines in the regex cannot be ignored

My solution (see below) is a bit on the slow side. How could it be optimised, both in terms of IO and string replacement?

data = open("INPUT__FILE").read()
o = open("OUTPUT_FILE","w")
for phrase in phrases: # these are the set of words I am talking about
        b1, b2 = str(phrase).strip().split(" ")
        regex = re.compile(r"%s\ *\t*\n*%s"%(b1,b2), re.IGNORECASE)
        data = regex.sub(b1+"_"+b2,data)
o.write(data)

UPDATE: 4x speed-up by converting all text to lower case and dropping re.IGNORECASE

you might try re.compile(r"(%s)\s*(%s)"%(b1,b2)) and regex.sub("\1_\2", data) - no idea if it will be faster. — andrew cooke
– andrew cooke, Commented Sep 2, 2011 at 17:18
also, if you followed my advice about \s*, change it to (?: |\t |\n)* - that might avoid issues with unicode that are all i can think of to explain the case effect. — andrew cooke
– andrew cooke, Commented Sep 2, 2011 at 17:46

andrew cooke · Accepted Answer · 2011-09-02 16:12:39Z

5

you could avoid recompiling your regexp for every file:

precompiled = []
for phrase in phrases:
    b1, b2 = str(phrase).strip().split(" ")
    precompiled.append(b1+"_"+b2, re.compile(r"%s\ *\t*\n*%s"%(b1,b2), re.IGNORECASE))

for (input, output) in ...:
    with open(output,"w") as o:
        with open(input) as i:
            data = i.read()
            for (pattern, regex) in precompiled:
                data = regex.sub(pattern, data)
            o.write(data)

it's the same for one file, but if you're repeating over many files then you are re-using the regexes.

disclaimer: untested, may contain typos.

[update] also, you can simplify the regexp a little by replacing the various space characters with \s*. i suspect you have a bug there, in that you would want to match " \t " and currently don't.

edited Sep 2, 2011 at 16:12

answered Sep 2, 2011 at 15:54

andrew cooke

47.1k10 gold badges102 silver badges148 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

mbatchkarov Over a year ago

Sounds like a reasonable thing to do. In practice, the processing time per file went down from 10.5 to 9.9 seconds.

andrew cooke Over a year ago

i realised why this doesn't help - the regexps are cached.

Prav · Accepted Answer · 2011-09-02 16:07:44Z

2

You can do this in 1 pass by using a B-Tree data structure to store your phrases. This is the fastest way of doing it with a time-complexity of N O(log h) where N is the number of characters in your input file and h is the length of your longest word. However, Python does not offer an out of the box implementation of a B-Tree.

You can also use a Hashtable (dictionary) and a replacement function to speed up things. This is easy to implement if the words you wish to replace are alphanumeric and single words only.

replace_data = {}

# Populate replace data here
for phrase in phrases:
    key, value = phrase.strip().split(' ')
    replace_data[key.lower()] = value

def replace_func(matchObj):
    # Function which replaces words
    key = matchObj.group(0).lower()
    if replace_data.has_key(key):
        return replace_data[key]
    else:
        return key

# Original code flow
data = open("INPUT_FILE").read()
output = re.sub("[a-zA-Z0-9]+", replace_func, data)
o = open('OUTPUT_FILE', 'w')
o.write(output)
o.close()

answered Sep 2, 2011 at 16:07

Prav

4011 gold badge3 silver badges12 bronze badges

2 Comments

Oben Sonne Over a year ago

I think it's not about replacing b1 by b2 but about replacing whitespace/tab/newline separated combinations of b1 and b2 by an underscore-separated combination.

andrew cooke Over a year ago

i'd love to see if this is faster. with python there's a trade-off between using the best algorithm and the fastest implementation. leaving everything as a regexp and tight loop may be a case of "worse is better" as it leaves more in c.

Collectives™ on Stack Overflow

Python: Faster regex replace

2 Answers 2

2 Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related