I'm currently writing a parser that detects certain identifiers in a bunch of source files. The identifiers will subsequently be renamed by adding prefix to make them unique. This renaming process can only happen after all files are processed because some of them are linked (e.g. class and instantiation of class need to get the same class identifier).
One of the first steps in this process is temporarily cutting all unnecessary code (string and parentheses content + line/block comments) to make the parsing easier and less time-consuming. These pieces are cut stored in a deque as namedtuples with the following structure: (index, value). After the parser and renamer have done their job, these pieces are pasted back to there original location (with an offset due to the file changes).
The previous code processes quickly, but a problem arises when I try to rebuild the files by inserting all trimmed pieces back into the file content:
while self.trimmedCode:
key, value = self.trimmedCode.pop()
parsedContent = ''.join((parsedContent[:key],value,parsedContent[key:]))
Some files contain a large amount of string/comments, making the rebuilding process very slow (+6 minutes for 150,000 insertions). My question? How can I make this insertion at an index more efficient?
Since string are immutable, I have tried to achieve a performance gain by turnin the string into a character list before doing all the insertions. This improves the speed of the while loop by about 10 %. However, the subsequent join operation nullifies the gained advantage:
charList = list(parsedContent)
while self.trimmedCode:
key, value = self.trimmedCode.pop()
charList[key:key] = value
parsedContent = ''.join(charList)
My question: is there a more efficient way to do this task (using Python 2.7)?
EDIT: Profiler stats
Relevant profiler stats:
Info: buildRenamedCopy rebuilds the file and contains the while loop, in which insertString does the join operation. This test was run on a collection of smaller files (+- 600 files)
ncalls tottime percall cumtime percall filename:lineno(function)
1284 9.998 0.008 137.834 0.107 file.py:146(buildRenamedCopy)
180923 59.810 0.000 110.459 0.001 file.py:142(insertString)
182213 50.652 0.000 50.657 0.000 {method 'join' of 'str' objects}
trimmedCodebeing popped in a significant order? That is, do thekeyindexes take into account the shifting caused by inserting the previous values? I'm assuming so (or your existing code probably wouldn't give the correct output). Unfortunately, that makes it rather difficult to modify the algorithm to avoidO(N)list inserts (andO(N^2)overall running time).joinspend its time (memory allocation?memcpy?...). Where are spend the 60s remaining (110s forinsertString50s forjoin) ?