0

I need help in parsing a text file which looks like this:

 WKU  03487472
 WKU 3487472
      Filed Apr. 30, 1968, Ser.  No. 725,329  
      Int.  Cl.  A42b 1122  
 AISD 19700106
 WKU  D2487471
 AISD 19700308
 WKU  03487471
      Filed J   16, 1969  
      [51] Int.  Cl.  A41d 25104  
 AISD 19700106

I would like to get some output in csv format as:

  WKU           Filed               Int.          AISD
 03487472    Apr. 30, 1968      A42b 1122      19700106
 D2487471          .                 .         19700308
 03487471      J   16, 1969      A41d 25104    19700106

I am not a programmer and begin to use python. I tried the script as follows:

        import csv
        import itertools

        def is_end_of_record(line):
           return line.startswith('WKU')

        class FieldClassifier(object):
           def __init__(self):
               self.field=''
           def __call__(self,row):
              if not row[0].isspace():
                  self.field=row.split(' ',1)[0]
              return self.field

        fields = 'WKU Filed Int. AISD'.split()
        with open('C:\Users\Na\Desktop\example.txt', 'r') as infile:
          with open('example.csv', 'wb') as oufile:
            writer = csv.DictWriter(oufile, fiels=fields) 
            writer.writerow(dict((h, h) for h in fields))
            for end_of_record, lines in itertools.groupby(infile,is_end_of_record):
               if not end_of_record:
                   classifier=FieldClassifier()
                   record={}
                   for fieldname, row in itertools.groupby(lines,classifier):
                        record[fieldname]='; '.join(r.strip() for r in row)

It seems not to work appropriately. I would greatly appreciate if anyone would like to help or provide any suggestions.

Thank you,

1
  • first of all, in line writer = csv.DictWriter(oufile, fiels=fields) change fiels=fields to fieldnames=fields Commented Feb 28, 2013 at 7:20

1 Answer 1

1

Format of your input file is not very strict. For such formats I think re module is very useful. I created regexpes for each record with grouping, 1st element is a key, and 2nd is a value. I also resigned from itertools:

import csv
import re

re_AISD = re.compile(r'(AISD)\s+(\S+)')
re_WKU = re.compile(r'(WKU)\s+(\S+)')
re_Filed = re.compile(r'(Filed)\s+(.*?\d{4})')
re_Int = re.compile(r'(Int.)  Cl.\s+(\w+ \d+)')

FLD_REGEXPES = (re_AISD, re_WKU, re_Filed, re_Int)

def get_field(line):
    for ree in FLD_REGEXPES:
        rx = ree.search(line)
        if rx:
            return (rx.group(1), rx.group(2))
    return (None, None)

def convert_file(fname):
    fields = 'WKU Filed Int. AISD'.split()
    f = open(fname, 'r')
    lines = f.readlines()
    f.close()
    with open(fname + '.csv', 'wb') as oufile:
        writer = csv.DictWriter(oufile, fieldnames=fields, restval = '?', dialect='excel-tab')
        writer.writerow(dict((h, h) for h in fields))
        rec = {}
        for line in lines:
            k, v = get_field(line)
            if k:
                print('[%s]=[%s]' % (k, v))
                if k == 'WKU': # start of new record
                    if rec:
                        writer.writerow(rec)
                    rec = {}
                rec[k] = v
        if rec:
            writer.writerow(rec)

Also notice C:/Users/Na/Desktop/example.txt - in Python \ character in strings is "escape" character used for newlines: \n, tabs: \t etc. In full path file names you can use \\ or better use / which works in both Windows and Unix environments. You can also use "raw" strings which are prefixed by r, I used such raw strings in re_AISD and other regexp definitions.

Sign up to request clarification or add additional context in comments.

5 Comments

The script works perfectly!! Thank you very much!! I get two further questions, the first thing is I didn't get the output file unless I delete the code line "def convert_file(fname):", what is the role of this line and do I need to keep or drop it? Second, it seems to get a better looking table if I dropped "restval = '?', dialect='excel-tab'", would you also tell me the role this command plays here. Thank you so much!
def convert_file(fname): - it declares function in Python. Such function can be invoked by: convert_file('C:/Users/Na/Desktop/example.txt')
See edited answer for information how to use Windows directory separator: \ in strings .
For restval and dialect you can see documentation for csv module. restval will be used for empty values . dialect will control field separator, escaping and such things. You can use your own dialect.
All your comments are very helpful. Thank you very much.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.