Memory Error Python Processing Large File Line by Line

Question

I am trying to concatenate model output files, the model run was broken up in 5 and each output corresponds to one of those partial run, due to the way the software outputs to file it start relabelling from 0 on each of the file outputs. I wrote some code to:

1) concatenate all the output files together 2) edit the merged file to re-label all timesteps, starting at 0 and increasing by an increment at each one.

The aim is that I can load this single file into my visualization software in one chunk, rather than open 5 different windows.

So far my code throws a memory error due to the large files I am dealing with.

I have a few ideas of how I could try and get rid of it but I'm not sure what will work or/and might slow things down to a crawl.

Code so far:

import os
import time

start_time = time.time()

#create new txt file in smae folder as python script

open("domain.txt","w").close()


"""create concatenated document of all tecplot output files"""
#look into file number 1

for folder in range(1,6,1): 
    folder = str(folder)
    for name in os.listdir(folder):
        if "domain" in name:
            with open(folder+'/'+name) as file_content_list:
                start = ""
                for line in file_content_list:
                    start = start + line# + '\n' 
                with open('domain.txt','a') as f:
                    f.write(start)
              #  print start

#identify file with "domain" in name
#extract contents
#append to the end of the new document with "domain" in folder level above
#once completed, add 1 to the file number previously searched and do again
#keep going until no more files with a higher number exist

""" replace the old timesteps with new timesteps """
#open folder named domain.txt
#Look for lines:
##ZONE T="0.000000000000e+00s", N=87715, E=173528, F=FEPOINT, ET=QUADRILATERAL
##STRANDID=1, SOLUTIONTIME=0.000000000000e+00
# if they are found edits them, otherwise copy the line without alteration

with open("domain.txt", "r") as combined_output:
    start = ""
    start_timestep = 0
    time_increment = 3.154e10
    for line in combined_output:
        if "ZONE" in line:
            start = start + 'ZONE T="' + str(start_timestep) + 's", N=87715, E=173528, F=FEPOINT, ET=QUADRILATERAL' + '\n'
        elif "STRANDID" in line:
            start = start + 'STRANDID=1, SOLUTIONTIME=' + str(start_timestep) + '\n'
            start_timestep = start_timestep + time_increment
        else:
            start = start + line

    with open('domain_final.txt','w') as f:
        f.write(start)

end_time = time.time()
print 'runtime : ', end_time-start_time

os.remove("domain.txt")

So far, I get the memory error at the concatenation stage.

To improve I could:

1) Try and do the corrections on the go as I read each file, but since it's already failing to go through an entire one I don't think that would make much of a difference other than computing time

2) Load all the file as into an array and make a function of the checks and run that function on the array:

Something like:

def do_correction(line):
        if "ZONE" in line:
            return 'ZONE T="' + str(start_timestep) + 's", N=87715, E=173528, F=FEPOINT, ET=QUADRILATERAL' + '\n'
        elif "STRANDID" in line:
            return 'STRANDID=1, SOLUTIONTIME=' + str(start_timestep) + '\n'
        else:
            return line

3) keep it as is and ask Python to indicate when it is about to run out of memory and write to the file at that stage. Anyone knows if that is possible ?

Thank you for your help

I don't get why you don't write the data on the fly instead of in the end. This is stream-compliant (note: it must take ages because of string concatenation). You have to rewrite your code completely. — Jean-François Fabre
– Jean-François Fabre ♦, Commented Feb 24, 2017 at 10:00
Open the output as well before your for loop and write the result for every line directly into the output file. Else your "start" variable will blow up with very large files. — MKesper
– MKesper, Commented Feb 24, 2017 at 10:00
I thought that writing to a file was a costly operation so I wanted to do stack everything in one string and just write this string at the end with a single write() statement. From your comment I gather that it is faster to have a f.write() at each line. — Sorade
– Sorade, Commented Feb 24, 2017 at 10:02
In- and output is buffered by Python, so you don't need to care about that too much. — MKesper
– MKesper, Commented Feb 24, 2017 at 10:03
Would I also need to file_content_list.close() at the end of each for line in file_content_list or does it make no difference ? — Sorade
– Sorade, Commented Feb 24, 2017 at 10:06

mhawke · Accepted Answer · 2017-02-24 10:40:06Z

It is not necessary to read the entire contents of each file into memory before writing to the output file. Large files will just consume, possibly all, available memory.

Simply read and write one line at a time. Also open the output file once only... and choose a name that will not be picked up and treated as an input file itself, otherwise you run the risk of concatenating the output file onto itself (not yet a problem, but could be if you also process files from the current directory) - if loading it doesn't already consume all memory.

import os.path

with open('output.txt', 'w') as outfile:
    for folder in range(1,6,1): 
        for name in os.listdir(folder):
            if "domain" in name:
                with open(os.path.join(str(folder), name)) as file_content_list:
                    for line in file_content_list:
                        # perform corrections/modifications to line here
                        outfile.write(line)

Now you can process the data in a line oriented manner - just modify it before writing to the output file.

Collectives™ on Stack Overflow

Memory Error Python Processing Large File Line by Line

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related