1

I have about 500,000+ txt file for about 7+ gigs of data total. I am using python to put them into a sqlite database. I am creating 2 tables, 1. is the pK and the hyperlink to the file. For the other table I am using an entity extractor that was devloped in perl by a coworker.

To accomplish this I am using subprocess.Popen(). T Prior to this method I was opening the perl at every iteration of my loop, but it was simply to expensive to be useful.

I need the perl to be dynamic, I need to be able to send data back and fourth from it and the process not terminate untilI tell it to do so. The perl was modified so it perl accepts the full string of a file as a stdin, and gives me a stdout when it gets a \n. But I am having trouble reading data...

If I use communicate, at the next iteration in my loop my subprocess is terminated, I get an I/O error. If I try and use readline() or read(), it locks up. Here are some examples of the differant behavior I am experiancing.

This deadlocks my system and I need to force close python to continue.

numberExtractor = subprocess.Popen(["C:\\Perl\\bin\\perl5.10.0.exe","D:\\MyDataExtractor\\extractSerialNumbers.pl"], stdout=subprocess.PIPE, stdin= subprocess.PIPE)
for infile in glob.glob(self.dirfilename + '\\*\\*.txt'):
   f = open(infile)
   reportString = f.read()
   f.close()

   reportString = reportString.replace('\n',' ')
   reportString = reportString.replace('\r',' ')
   reportString = reportString +'\n'

   numberExtractor.stdin.write(reportString)
   x = numberExtractor.stdout.read()        #I can not see the STDOUT, python freezes and does not run past here.

   print x

This cancels the subprocess and I get an I/O error at the next iteration of my loop.

numberExtractor = subprocess.Popen(["C:\\Perl\\bin\\perl5.10.0.exe","D:\\MyDataExtractor\\extractSerialNumbers.pl"], stdout=subprocess.PIPE, stdin= subprocess.PIPE)
for infile in glob.glob(self.dirfilename + '\\*\\*.txt'):

   f = open(infile)
   reportString = f.read()
   f.close()

   reportString = reportString.replace('\n',' ')
   reportString = reportString.replace('\r',' ')
   reportString = reportString +'\n'
   numberExtractor.stdin.write(reportString)
   x = numberExtractor.communicate()   #Works good, I can see my STDOUT from perl but the process terminates and will not run on the next iteration

   print x

If I just run it like this, It runs through all the code fine. the print line is ', mode 'rb' at 0x015dbf08> for each item in my folder.

numberExtractor = subprocess.Popen(["C:\\Perl\\bin\\perl5.10.0.exe","D:\\MyDataExtractor\\extractSerialNumbers.pl"], stdout=subprocess.PIPE, stdin= subprocess.PIPE)
for infile in glob.glob(self.dirfilename + '\\*\\*.txt'):
   f = open(infile)
   reportString = f.read()
   f.close()

   reportString = reportString.replace('\n',' ')
   reportString = reportString.replace('\r',' ')
   reportString = reportString +'\n'

   numberExtractor.stdin.write(reportString)
   x = numberExtractor.stdout                #I can not get the value of the object, but it runs through all my files fine.

   print x

Hopefully I am making a simple mistake, but is there some way I can just send a file to my perll (stdin), get the stdout, and then repeat without having to reopen my subprocess for every file in my loop?

2
  • Can the Perl program be easily translated to Python? Can this program be easily translated to Perl? Less complexity will help here. Commented Nov 30, 2010 at 17:57
  • Thats not really an option in this case, that was my first thought before I even started down this road. Commented Nov 30, 2010 at 18:13

1 Answer 1

2

Consider using the shell. Life is simpler.

perl extractSerialNumbers.pl *.txt | python load_database.py

Don't mess around with having Python start perl and all that. Just read the results from perl and process those results in Python.

Since both processes run concurrently, this tends to be pretty fast and use a lot of CPU resources without much programming on your part.

In the Python program (load_database.py) you can simply use fileinput module to read the entire file provided on stdin.

import fileinput
for line in fileinput.input():
    load the row into the database

That's about all you need in the Python program if you make the shell do the dirty work of setting up the pipeline.

Sign up to request clarification or add additional context in comments.

11 Comments

+1 for recommending shell. But why use fileinput in this particular case instead of the simpler "for line in sys.stdin"?
@tokland: (1) it's not much simpler. (2) in the long run, handling pipe vs. < redirect vs. list of filenames is trivial when fileinput.
I am running this on a windows machine
I haven't ever used fileinput before...and am still confused. I'm reading about fileinput docs.python.org/library/fileinput.html. I still dont see how this helps me send my data to the perl. According to documentation "This module implements a helper class and functions to quickly write a loop over standard input or a list of files." But dont I need to still pipe this data to my perl?
@dfami: Don't send stuff to perl from Python. Break things up so that it's a simple 1-direction pipeline. From something to perl to python in one simple direction.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.