3

I have a large set of large files and a set of "phrases" that need to be replaced in each file.
The "business logic" imposes several restrictions:

  • Matching must be case-insensitive
  • The whitespace, tabs and new lines in the regex cannot be ignored

My solution (see below) is a bit on the slow side. How could it be optimised, both in terms of IO and string replacement?

data = open("INPUT__FILE").read()
o = open("OUTPUT_FILE","w")
for phrase in phrases: # these are the set of words I am talking about
        b1, b2 = str(phrase).strip().split(" ")
        regex = re.compile(r"%s\ *\t*\n*%s"%(b1,b2), re.IGNORECASE)
        data = regex.sub(b1+"_"+b2,data)
o.write(data)

UPDATE: 4x speed-up by converting all text to lower case and dropping re.IGNORECASE

3
  • wow. impressive change with/out case. Commented Sep 2, 2011 at 17:16
  • you might try re.compile(r"(%s)\s*(%s)"%(b1,b2)) and regex.sub("\1_\2", data) - no idea if it will be faster. Commented Sep 2, 2011 at 17:18
  • also, if you followed my advice about \s*, change it to (?: |\t |\n)* - that might avoid issues with unicode that are all i can think of to explain the case effect. Commented Sep 2, 2011 at 17:46

2 Answers 2

5

you could avoid recompiling your regexp for every file:

precompiled = []
for phrase in phrases:
    b1, b2 = str(phrase).strip().split(" ")
    precompiled.append(b1+"_"+b2, re.compile(r"%s\ *\t*\n*%s"%(b1,b2), re.IGNORECASE))

for (input, output) in ...:
    with open(output,"w") as o:
        with open(input) as i:
            data = i.read()
            for (pattern, regex) in precompiled:
                data = regex.sub(pattern, data)
            o.write(data)

it's the same for one file, but if you're repeating over many files then you are re-using the regexes.

disclaimer: untested, may contain typos.

[update] also, you can simplify the regexp a little by replacing the various space characters with \s*. i suspect you have a bug there, in that you would want to match " \t " and currently don't.

Sign up to request clarification or add additional context in comments.

2 Comments

Sounds like a reasonable thing to do. In practice, the processing time per file went down from 10.5 to 9.9 seconds.
i realised why this doesn't help - the regexps are cached.
2

You can do this in 1 pass by using a B-Tree data structure to store your phrases. This is the fastest way of doing it with a time-complexity of N O(log h) where N is the number of characters in your input file and h is the length of your longest word. However, Python does not offer an out of the box implementation of a B-Tree.

You can also use a Hashtable (dictionary) and a replacement function to speed up things. This is easy to implement if the words you wish to replace are alphanumeric and single words only.

replace_data = {}

# Populate replace data here
for phrase in phrases:
    key, value = phrase.strip().split(' ')
    replace_data[key.lower()] = value

def replace_func(matchObj):
    # Function which replaces words
    key = matchObj.group(0).lower()
    if replace_data.has_key(key):
        return replace_data[key]
    else:
        return key

# Original code flow
data = open("INPUT_FILE").read()
output = re.sub("[a-zA-Z0-9]+", replace_func, data)
o = open('OUTPUT_FILE', 'w')
o.write(output)
o.close()

2 Comments

I think it's not about replacing b1 by b2 but about replacing whitespace/tab/newline separated combinations of b1 and b2 by an underscore-separated combination.
i'd love to see if this is faster. with python there's a trade-off between using the best algorithm and the fastest implementation. leaving everything as a regexp and tight loop may be a case of "worse is better" as it leaves more in c.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.