Replace Multiple Strings in a Large Text File in Python

Question

Problem:

Replacing multiple string patterns in a large text file is taking a lot of time. (Python)

Scenario:

I have a large text file with no particular structure to it. But, it contains several patterns. For example, email addresses and phone numbers.

The text file has over 100 different such patterns and the file is of size 10mb (size could increase). The text file may or may not contain all the 100 patterns.

At present, I am replacing the matches using re.sub() and the approach for performing replaces looks as shown below.

readfile = gzip.open(path, 'r') # read the zipped file
lines = readfile.readlines() # load the lines 

for line in lines:
    if len(line.strip()) != 0: # strip the empty lines
        linestr += line

for pattern in patterns: # patterns contains all regex and respective replaces
    regex = pattern[0]
    replace = pattern[1]
    compiled_regex = compile_regex(regex)
    linestr = re.sub(compiled_regex, replace, linestr)

This approach is taking a lot of time for large files. Is there a better way to optimize it?

I am thinking of replacing += with .join() but not sure how much that would help.

If you have such a big file, you could also sort your data with a primary key once and then simply perform a binary search, which will greatly improve performance. It's a one-time trade-off and seems like a quick win for me. Also, at that size, use of a database should be considered. If you're dealing with a lot of data, applying a structure to it almost always yields a big improvement. Hence the reason that universities often teach data structures as a single course. — FMaz
– FMaz, Commented Dec 16, 2016 at 21:59
@Krazor: The question author says the file has no structure. So I'm wondering how you're thinking of sorting it? — Bill Bell
– Bill Bell, Commented Dec 16, 2016 at 22:02
Excuse me then. You should definitely, as mentioned by @salah consider the use of a generator! — FMaz
– FMaz, Commented Dec 16, 2016 at 22:37

salah · Accepted Answer · 2016-12-16 22:13:56Z

2

you could use lineprofiler to find which lines in your code take the most time

pip install line_profiler    
kernprof -l run.py

another thing, I think you're building the string too large in memory, maybe you can make use of generators

answered Dec 16, 2016 at 22:13

salah

5044 silver badges9 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

FMaz Over a year ago

The use of a generator makes sense in this context. I don't understand how lineprofiler will help though?

Sanath Kumar Over a year ago

@salah Thank you, I am not particularly aware of generators. I will look into that. So, in your opinion, splitting the text file into chunks and performing regex substitutions might be more optimized?

FMaz Over a year ago

I'm not entirely sure whether it'll be faster, it will definitely be more efficient though.

Pierre Alex · Accepted Answer · 2016-12-16 22:27:28Z

1

You may obtain slightly better results doing :

large_list = []

with gzip.open(path, 'r') as fp:
    for line in fp.readlines():
        if line.strip():
            large_list.append(line)

merged_lines = ''.join(large_list)

for regex, replace in patterns:
    compiled_regex = compile_regex(regex)
    merged_lines = re.sub(compiled_regex, replace, merged_lines)

However, further optimization can be achieved knowing what kind of processing you apply. In fact the last line will be the one that takes up all CPU power (and memory allocation). If regexes can be applied on a per-line basis, you can achieve great results using the multiprocessing package. Threading won't give you anything because of the GIL (https://wiki.python.org/moin/GlobalInterpreterLock)

answered Dec 16, 2016 at 22:27

Pierre Alex

4732 silver badges11 bronze badges

3 Comments

Sanath Kumar Over a year ago

I echo your thoughts on multiprocessing. And, my scenario involves of both applying regexes line by line and across lines (fixed structure).

Sanath Kumar Over a year ago

Would it be too expensive to merge all lines, perform regex matches in the merged_lines and split them again later to perform matches per line? Because there could be multiple blocks of text patterns which could get replaced and it would reduce the length of the file to analyze line by line.

Pierre Alex Over a year ago

You may test different variations - the next thing you could do is identify the bottleneck (cpu -> try multiprocessing by splitting your source file or your workflow ; IO -> load everything in memory first)

Collectives™ on Stack Overflow

Replace Multiple Strings in a Large Text File in Python

2 Answers 2

3 Comments

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related