2

I'm currently writing a parser that detects certain identifiers in a bunch of source files. The identifiers will subsequently be renamed by adding prefix to make them unique. This renaming process can only happen after all files are processed because some of them are linked (e.g. class and instantiation of class need to get the same class identifier).

One of the first steps in this process is temporarily cutting all unnecessary code (string and parentheses content + line/block comments) to make the parsing easier and less time-consuming. These pieces are cut stored in a deque as namedtuples with the following structure: (index, value). After the parser and renamer have done their job, these pieces are pasted back to there original location (with an offset due to the file changes).

The previous code processes quickly, but a problem arises when I try to rebuild the files by inserting all trimmed pieces back into the file content:

while self.trimmedCode:
    key, value = self.trimmedCode.pop()
    parsedContent = ''.join((parsedContent[:key],value,parsedContent[key:]))

Some files contain a large amount of string/comments, making the rebuilding process very slow (+6 minutes for 150,000 insertions). My question? How can I make this insertion at an index more efficient?

Since string are immutable, I have tried to achieve a performance gain by turnin the string into a character list before doing all the insertions. This improves the speed of the while loop by about 10 %. However, the subsequent join operation nullifies the gained advantage:

charList = list(parsedContent)

while self.trimmedCode:
    key, value = self.trimmedCode.pop()
    charList[key:key] = value

parsedContent = ''.join(charList)

My question: is there a more efficient way to do this task (using Python 2.7)?

EDIT: Profiler stats

Relevant profiler stats:
Info: buildRenamedCopy rebuilds the file and contains the while loop, in which insertString does the join operation. This test was run on a collection of smaller files (+- 600 files)

ncalls     tottime  percall  cumtime  percall filename:lineno(function)  
1284       9.998    0.008    137.834  0.107   file.py:146(buildRenamedCopy)  
180923     59.810   0.000    110.459  0.001   file.py:142(insertString)  
182213     50.652   0.000    50.657   0.000   {method 'join' of 'str' objects}  
10
  • Have your profiled you code in order to clearly identify the bottleneck(s) ? Commented Aug 28, 2014 at 9:00
  • 1
    As a complement, if you really have huge string manipulations, maybe a rope implementation would be preferable to a strings. Don't know if there is some Python support available. Commented Aug 28, 2014 at 9:07
  • 1
    Question edited to include profiler stats. Commented Aug 28, 2014 at 9:44
  • 2
    Just to make sure I understand the problem correctly, are the items in trimmedCode being popped in a significant order? That is, do the key indexes take into account the shifting caused by inserting the previous values? I'm assuming so (or your existing code probably wouldn't give the correct output). Unfortunately, that makes it rather difficult to modify the algorithm to avoid O(N) list inserts (and O(N^2) overall running time). Commented Aug 28, 2014 at 10:47
  • 1
    Out of curiosity, does the profiler show where join spend its time (memory allocation? memcpy ?...). Where are spend the 60s remaining (110s for insertString 50s for join) ? Commented Aug 28, 2014 at 11:08

3 Answers 3

4

The reason your code is slow is that inserting into a list is O(N) (where N is the length of the list). That's because all of the values in the list after the insertion point need to be shifted in order to make room for the new values you're inserting.

You can fix this by replacing the existing value at a given position with a new string (the inserted value plus the previous character) rather than doing a slice assignment which grows the list. This only works if each key is smaller than all the previous ones (that is, if you're iterating over the keys in descending order).

Here's a minimally modified version of your code which should have improved asymptotic performance (from O(N^2) to O(N)):

charList = list(parsedContent)

while self.trimmedCode:
    key, value = self.trimmedCode.pop()
    charList[key] = value + charList[key] # the change is here! No O(N) slice assignment!

parsedContent = ''.join(charList)

Note, you can probably also replace the while and pop with the simpler and clearer for loop:

for key, value in reversed(trimmedCode):

You might get even better performance if you used a generator function to produce the sequence of strings to be joined in larger chunks, rather than splitting the original string up into individual characters. This isn't an asymptotic performance change, but may give a large constant factor improvement. Here's an attempt at this:

def insert_gen(orig_string, insertions):
    prev_key = 0
    for key, value in insertions:
        yield orig_string[prev_key:key] # yield text from the previous insert to this one
        yield value                     # yield the "inserted" text
        prev_key = key
    yield orig_string[prev_key:]        # yield trailing text (after last insert)

You'd use it like this:

parsedContent = "".join(insert_gen(parsedContent, self.trimmedCode))
Sign up to request clarification or add additional context in comments.

2 Comments

Oh damn, I had not noticed there was a quadratic performance bug there. Good catch.
Wow, I should have looked longer at the "charList" alternative. This offers a dramatic performance increase! Thank you for the clear answer.
1

You might get algorithmic gains by using a single string accumulator to join at the end.

Something along the lines of:

lastkey = 0
accumulator = []
while self.trimmedCode:
    key, value = self.trimmedCode.pop()
    accumulator.extend((parsedContent[lastkey:key], value))
    lastkey = key
accumulator.append(parsedContent[lastkey:])
parsedContent = ''.join(accumulator)

It could be much faster than what you are doing now. For extra points, use a generator instead of an accumulator, as Blckknght suggests.

But if it is not fast enough you should spend your time looking at Cython, or maybe trying some existing data structure that might be more efficient for this case. I would give a try to gapbuffer.

2 Comments

Hmm, we came up with almost exactly the same algorithm at the same time (though I used a generator rather than a list). I'm not sure a gapbuffer or rope would help if the inserts are all being done in one go (if there were going to be many inserts at different times into the same text you'd probably be right).
Yeah. I thought about the generator, but I decided to omit it at first. Then I saw your answer.
0

You could gain some speed by using + instead of ''.join() to concatenate strings:

while self.trimmedCode:
    key, value = self.trimmedCode.pop()
    parsedContent = parsedContent[:key] + value + parsedContent[key:]

1 Comment

This does not offer a notable speed increase. The + operation "reduces" the time of insertString from 110 to 109 seconds cumulitive time.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.