1

Problem:

Replacing multiple string patterns in a large text file is taking a lot of time. (Python)

Scenario:

I have a large text file with no particular structure to it. But, it contains several patterns. For example, email addresses and phone numbers.

The text file has over 100 different such patterns and the file is of size 10mb (size could increase). The text file may or may not contain all the 100 patterns.

At present, I am replacing the matches using re.sub() and the approach for performing replaces looks as shown below.

readfile = gzip.open(path, 'r') # read the zipped file
lines = readfile.readlines() # load the lines 

for line in lines:
    if len(line.strip()) != 0: # strip the empty lines
        linestr += line

for pattern in patterns: # patterns contains all regex and respective replaces
    regex = pattern[0]
    replace = pattern[1]
    compiled_regex = compile_regex(regex)
    linestr = re.sub(compiled_regex, replace, linestr)

This approach is taking a lot of time for large files. Is there a better way to optimize it?

I am thinking of replacing += with .join() but not sure how much that would help.

14
  • Do you have regex patterns to look for or simple strings? Commented Dec 16, 2016 at 21:57
  • If you have such a big file, you could also sort your data with a primary key once and then simply perform a binary search, which will greatly improve performance. It's a one-time trade-off and seems like a quick win for me. Also, at that size, use of a database should be considered. If you're dealing with a lot of data, applying a structure to it almost always yields a big improvement. Hence the reason that universities often teach data structures as a single course. Commented Dec 16, 2016 at 21:59
  • @Krazor: The question author says the file has no structure. So I'm wondering how you're thinking of sorting it? Commented Dec 16, 2016 at 22:02
  • 2
    Related: stackoverflow.com/questions/15175142/… Commented Dec 16, 2016 at 22:06
  • 1
    Excuse me then. You should definitely, as mentioned by @salah consider the use of a generator! Commented Dec 16, 2016 at 22:37

2 Answers 2

2

you could use lineprofiler to find which lines in your code take the most time

pip install line_profiler    
kernprof -l run.py

another thing, I think you're building the string too large in memory, maybe you can make use of generators

Sign up to request clarification or add additional context in comments.

3 Comments

The use of a generator makes sense in this context. I don't understand how lineprofiler will help though?
@salah Thank you, I am not particularly aware of generators. I will look into that. So, in your opinion, splitting the text file into chunks and performing regex substitutions might be more optimized?
I'm not entirely sure whether it'll be faster, it will definitely be more efficient though.
1

You may obtain slightly better results doing :

large_list = []

with gzip.open(path, 'r') as fp:
    for line in fp.readlines():
        if line.strip():
            large_list.append(line)

merged_lines = ''.join(large_list)

for regex, replace in patterns:
    compiled_regex = compile_regex(regex)
    merged_lines = re.sub(compiled_regex, replace, merged_lines)

However, further optimization can be achieved knowing what kind of processing you apply. In fact the last line will be the one that takes up all CPU power (and memory allocation). If regexes can be applied on a per-line basis, you can achieve great results using the multiprocessing package. Threading won't give you anything because of the GIL (https://wiki.python.org/moin/GlobalInterpreterLock)

3 Comments

I echo your thoughts on multiprocessing. And, my scenario involves of both applying regexes line by line and across lines (fixed structure).
Would it be too expensive to merge all lines, perform regex matches in the merged_lines and split them again later to perform matches per line? Because there could be multiple blocks of text patterns which could get replaced and it would reduce the length of the file to analyze line by line.
You may test different variations - the next thing you could do is identify the bottleneck (cpu -> try multiprocessing by splitting your source file or your workflow ; IO -> load everything in memory first)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.