I have about 500,000+ txt file for about 7+ gigs of data total. I am using python to put them into a sqlite database. I am creating 2 tables, 1. is the pK and the hyperlink to the file. For the other table I am using an entity extractor that was devloped in perl by a coworker.
To accomplish this I am using subprocess.Popen(). T Prior to this method I was opening the perl at every iteration of my loop, but it was simply to expensive to be useful.
I need the perl to be dynamic, I need to be able to send data back and fourth from it and the process not terminate untilI tell it to do so. The perl was modified so it perl accepts the full string of a file as a stdin, and gives me a stdout when it gets a \n. But I am having trouble reading data...
If I use communicate, at the next iteration in my loop my subprocess is terminated, I get an I/O error. If I try and use readline() or read(), it locks up. Here are some examples of the differant behavior I am experiancing.
This deadlocks my system and I need to force close python to continue.
numberExtractor = subprocess.Popen(["C:\\Perl\\bin\\perl5.10.0.exe","D:\\MyDataExtractor\\extractSerialNumbers.pl"], stdout=subprocess.PIPE, stdin= subprocess.PIPE)
for infile in glob.glob(self.dirfilename + '\\*\\*.txt'):
f = open(infile)
reportString = f.read()
f.close()
reportString = reportString.replace('\n',' ')
reportString = reportString.replace('\r',' ')
reportString = reportString +'\n'
numberExtractor.stdin.write(reportString)
x = numberExtractor.stdout.read() #I can not see the STDOUT, python freezes and does not run past here.
print x
This cancels the subprocess and I get an I/O error at the next iteration of my loop.
numberExtractor = subprocess.Popen(["C:\\Perl\\bin\\perl5.10.0.exe","D:\\MyDataExtractor\\extractSerialNumbers.pl"], stdout=subprocess.PIPE, stdin= subprocess.PIPE)
for infile in glob.glob(self.dirfilename + '\\*\\*.txt'):
f = open(infile)
reportString = f.read()
f.close()
reportString = reportString.replace('\n',' ')
reportString = reportString.replace('\r',' ')
reportString = reportString +'\n'
numberExtractor.stdin.write(reportString)
x = numberExtractor.communicate() #Works good, I can see my STDOUT from perl but the process terminates and will not run on the next iteration
print x
If I just run it like this, It runs through all the code fine. the print line is ', mode 'rb' at 0x015dbf08> for each item in my folder.
numberExtractor = subprocess.Popen(["C:\\Perl\\bin\\perl5.10.0.exe","D:\\MyDataExtractor\\extractSerialNumbers.pl"], stdout=subprocess.PIPE, stdin= subprocess.PIPE)
for infile in glob.glob(self.dirfilename + '\\*\\*.txt'):
f = open(infile)
reportString = f.read()
f.close()
reportString = reportString.replace('\n',' ')
reportString = reportString.replace('\r',' ')
reportString = reportString +'\n'
numberExtractor.stdin.write(reportString)
x = numberExtractor.stdout #I can not get the value of the object, but it runs through all my files fine.
print x
Hopefully I am making a simple mistake, but is there some way I can just send a file to my perll (stdin), get the stdout, and then repeat without having to reopen my subprocess for every file in my loop?