extract variables from text and write into csv with python

Question

I need help in parsing a text file which looks like this:

 WKU  03487472
 WKU 3487472
      Filed Apr. 30, 1968, Ser.  No. 725,329  
      Int.  Cl.  A42b 1122  
 AISD 19700106
 WKU  D2487471
 AISD 19700308
 WKU  03487471
      Filed J   16, 1969  
      [51] Int.  Cl.  A41d 25104  
 AISD 19700106

I would like to get some output in csv format as:

  WKU           Filed               Int.          AISD
 03487472    Apr. 30, 1968      A42b 1122      19700106
 D2487471          .                 .         19700308
 03487471      J   16, 1969      A41d 25104    19700106

I am not a programmer and begin to use python. I tried the script as follows:

        import csv
        import itertools

        def is_end_of_record(line):
           return line.startswith('WKU')

        class FieldClassifier(object):
           def __init__(self):
               self.field=''
           def __call__(self,row):
              if not row[0].isspace():
                  self.field=row.split(' ',1)[0]
              return self.field

        fields = 'WKU Filed Int. AISD'.split()
        with open('C:\Users\Na\Desktop\example.txt', 'r') as infile:
          with open('example.csv', 'wb') as oufile:
            writer = csv.DictWriter(oufile, fiels=fields) 
            writer.writerow(dict((h, h) for h in fields))
            for end_of_record, lines in itertools.groupby(infile,is_end_of_record):
               if not end_of_record:
                   classifier=FieldClassifier()
                   record={}
                   for fieldname, row in itertools.groupby(lines,classifier):
                        record[fieldname]='; '.join(r.strip() for r in row)

It seems not to work appropriately. I would greatly appreciate if anyone would like to help or provide any suggestions.

Thank you,

first of all, in line writer = csv.DictWriter(oufile, fiels=fields) change fiels=fields to fieldnames=fields — avasal
– avasal, Commented Feb 28, 2013 at 7:20

Michał Niklas · Accepted Answer · 2013-03-01 06:15:22Z

1

Format of your input file is not very strict. For such formats I think re module is very useful. I created regexpes for each record with grouping, 1st element is a key, and 2nd is a value. I also resigned from itertools:

import csv
import re

re_AISD = re.compile(r'(AISD)\s+(\S+)')
re_WKU = re.compile(r'(WKU)\s+(\S+)')
re_Filed = re.compile(r'(Filed)\s+(.*?\d{4})')
re_Int = re.compile(r'(Int.)  Cl.\s+(\w+ \d+)')

FLD_REGEXPES = (re_AISD, re_WKU, re_Filed, re_Int)

def get_field(line):
    for ree in FLD_REGEXPES:
        rx = ree.search(line)
        if rx:
            return (rx.group(1), rx.group(2))
    return (None, None)

def convert_file(fname):
    fields = 'WKU Filed Int. AISD'.split()
    f = open(fname, 'r')
    lines = f.readlines()
    f.close()
    with open(fname + '.csv', 'wb') as oufile:
        writer = csv.DictWriter(oufile, fieldnames=fields, restval = '?', dialect='excel-tab')
        writer.writerow(dict((h, h) for h in fields))
        rec = {}
        for line in lines:
            k, v = get_field(line)
            if k:
                print('[%s]=[%s]' % (k, v))
                if k == 'WKU': # start of new record
                    if rec:
                        writer.writerow(rec)
                    rec = {}
                rec[k] = v
        if rec:
            writer.writerow(rec)

Also notice C:/Users/Na/Desktop/example.txt - in Python \ character in strings is "escape" character used for newlines: \n, tabs: \t etc. In full path file names you can use \\ or better use / which works in both Windows and Unix environments. You can also use "raw" strings which are prefixed by r, I used such raw strings in re_AISD and other regexp definitions.

edited Mar 1, 2013 at 6:15

answered Feb 28, 2013 at 9:00

Michał Niklas

54.5k19 gold badges76 silver badges125 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

NaNa Over a year ago

The script works perfectly!! Thank you very much!! I get two further questions, the first thing is I didn't get the output file unless I delete the code line "def convert_file(fname):", what is the role of this line and do I need to keep or drop it? Second, it seems to get a better looking table if I dropped "restval = '?', dialect='excel-tab'", would you also tell me the role this command plays here. Thank you so much!

Michał Niklas Over a year ago

def convert_file(fname): - it declares function in Python. Such function can be invoked by: convert_file('C:/Users/Na/Desktop/example.txt')

Michał Niklas Over a year ago

See edited answer for information how to use Windows directory separator: \ in strings .

Michał Niklas Over a year ago

For restval and dialect you can see documentation for csv module. restval will be used for empty values . dialect will control field separator, escaping and such things. You can use your own dialect.

NaNa Over a year ago

All your comments are very helpful. Thank you very much.

Collectives™ on Stack Overflow

extract variables from text and write into csv with python

1 Answer 1

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related