file processing in python

Question

I'm working on text file processing using Python. I've got a text file (ctl_Files.txt) which has the following content/ or similar to this:

------------------------
Changeset: 143
User: Sarfaraz
Date: Tuesday, April 05, 2011 5:34:54 PM

Comment:
  Initial add, all objects.

Items:
  add $/Systems/DB/Expences/Loader
  add $/Systems/DB/Expences/Loader/AAA.txt
  add $/Systems/DB/Expences/Loader/BBB.txt
  add $/Systems/DB/Expences/Loader/CCC.txt  

Check-in Notes:
  Code Reviewer:
  Performance Reviewer:
  Reviewer:
  Security Reviewer:
------------------------
Changeset: 145
User: Sarfaraz
Date: Thursday, April 07, 2011 5:34:54 PM

Comment:
  edited objects.

Items:
  edit $/Systems/DB/Expences/Loader
  edit $/Systems/DB/Expences/Loader/AAA.txt
  edit $/Systems/DB/Expences/Loader/AAB.txt  

Check-in Notes:
  Code Reviewer:
  Performance Reviewer:
  Reviewer:
  Security Reviewer:
------------------------
Changeset: 147
User: Sarfaraz
Date: Wednesday, April 06, 2011 5:34:54 PM

Comment:
  Initial add, all objects.

Items:
  delete, source rename $/Systems/DB/Expences/Loader/AAA.txt;X34892
  rename                $/Systems/DB/Expences/Loader/AAC.txt.

Check-in Notes:
  Code Reviewer:
  Performance Reviewer:
  Reviewer:
  Security Reviewer:
------------------------

To process this file I wrote the following code:

#Tags - used for spliting the information

tag1 = 'Changeset:'
tag2 = 'User:'
tag3 = 'Date:'
tag4 = 'Comment:'
tag5 = 'Items:'
tag6 = 'Check-in Notes:'

#opening and reading the input file
#In path to input file use '\' as escape character
with open ("C:\\Users\\md_sarfaraz\\Desktop\\ctl_Files.txt", "r") as myfile:
    val=myfile.read().replace('\n', ' ')

#counting the occurence of any one of the above tag
#As count will be same for all the tags
occurence = val.count(tag1)

#initializing row variable
row=""

#passing the count - occurence to the loop
for count in  range(1, occurence+1):
   row += ( (val.split(tag1)[count].split(tag2)[0]).strip() + '|' \
    + (val.split(tag2)[count].split(tag3)[0]).strip() + '|' \
    + (val.split(tag3)[count].split(tag4)[0]).strip() + '|' \
    + (val.split(tag4)[count].split(tag5)[0]).strip() + '|' \
    + (val.split(tag5)[count].split(tag6)[0]).strip() + '\n')

#opening and writing the output file
#In path to output file use '\' as escape character
file = open("C:\\Users\\md_sarfaraz\\Desktop\\processed_ctl_Files.txt", "w+") 
file.write(row)
file.close()

and got the following result/File (processed_ctl_Files.txt):

143|Sarfaraz|Tuesday, April 05, 2011 5:34:54 PM|Initial add, all objects.|add $/Systems/DB/Expences/Loader   add $/Systems/DB/Expences/Loader/AAA.txt   add $/Systems/DB/Expences/Loader/BBB.txt   add $/Systems/DB/Expences/Loader/CCC.txt
145|Sarfaraz|Thursday, April 07, 2011 5:34:54 PM|edited objects.|edit $/Systems/DB/Expences/Loader   edit $/Systems/DB/Expences/Loader/AAA.txt   edit $/Systems/DB/Expences/Loader/AAB.txt
147|Sarfaraz|Wednesday, April 06, 2011 5:34:54 PM|Initial add, all objects.|delete, source rename $/Systems/DB/Rascal/Expences/AAA.txt;X34892   rename                $/Systems/DB/Rascal/Expences/AAC.txt.

But, I want the result like this:

143|Sarfaraz|Tuesday, April 05, 2011 5:34:54 PM|Initial add, all objects.|add $/Systems/DB/Expences/Loader   
                                                                          add $/Systems/DB/Expences/Loader/AAA.txt   
                                                                          add $/Systems/DB/Expences/Loader/BBB.txt   
                                                                          add $/Systems/DB/Expences/Loader/CCC.txt
145|Sarfaraz|Thursday, April 07, 2011 5:34:54 PM|edited objects.|edit $/Systems/DB/Expences/Loader   
                                                                 edit $/Systems/DB/Expences/Loader/AAA.txt   
                                                                 edit $/Systems/DB/Expences/Loader/AAB.txt
147|Sarfaraz|Wednesday, April 06, 2011 5:34:54 PM|Initial add, all objects.|delete, source rename $/Systems/DB/Rascal/Expences/AAA.txt;X34892   
                                                                            rename                $/Systems/DB/Rascal/Expences/AAC.txt.

or it would be great if we can get results like this :

143|Sarfaraz|Tuesday, April 05, 2011 5:34:54 PM|Initial add, all objects.|add $/Systems/DB/Expences/Loader   
143|Sarfaraz|Tuesday, April 05, 2011 5:34:54 PM|Initial add, all objects.|add $/Systems/DB/Expences/Loader/AAA.txt   
143|Sarfaraz|Tuesday, April 05, 2011 5:34:54 PM|Initial add, all objects.|add $/Systems/DB/Expences/Loader/BBB.txt   
143|Sarfaraz|Tuesday, April 05, 2011 5:34:54 PM|Initial add, all objects.|add $/Systems/DB/Expences/Loader/CCC.txt
145|Sarfaraz|Thursday, April 07, 2011 5:34:54 PM|edited objects.|edit $/Systems/DB/Expences/Loader   
145|Sarfaraz|Thursday, April 07, 2011 5:34:54 PM|edited objects.|edit $/Systems/DB/Expences/Loader/AAA.txt   
145|Sarfaraz|Thursday, April 07, 2011 5:34:54 PM|edited objects.|edit $/Systems/DB/Expences/Loader/AAB.txt
147|Sarfaraz|Wednesday, April 06, 2011 5:34:54 PM|Initial add, all objects.|delete, source rename $/Systems/DB/Rascal/Expences/AAA.txt;X34892   
147|Sarfaraz|Wednesday, April 06, 2011 5:34:54 PM|Initial add, all objects.|rename                $/Systems/DB/Rascal/Expences/AAC.txt.

Let me know how I can do this. Also, I'm very new to Python so please ignore if I've written some lousy or redundant code. And help me to improve this.

You removed all the newlines, even though you want them in your output? Just don't do that, and parse the input in a way that doesn't require it. — abarnert
– abarnert, Commented May 20, 2015 at 9:30

Spacy · Accepted Answer · 2015-05-20 11:30:10Z

This solution is not as short and probably not as effective as the answer utilizing regular expressions, but it should be quite easy to understand. The solution does make it easier to use the parsed data because each section data is stored into a dictionary.

    ctl_file = "ctl_Files.txt" # path of source file
    processed_ctl_file = "processed_ctl_Files.txt" # path of destination file

    #Tags - used for spliting the information
    changeset_tag = 'Changeset:'
    user_tag = 'User:'
    date_tag = 'Date:'
    comment_tag = 'Comment:'
    items_tag = 'Items:'
    checkin_tag = 'Check-in Notes:'

    section_separator = "------------------------"
    changesets = []

    #open and read the input file
    with open(ctl_file, 'r') as read_file:
        first_section = True
        changeset_dict = {}
        items = []
        comment_stage = False
        items_stage = False
        checkin_dict = {}
        # Read one line at a time
        for line in read_file:
            # Check which tag matches the current line and store the data to matching key in the dictionary
            if changeset_tag in line:
                changeset = line.split(":")[1].strip()
                changeset_dict[changeset_tag] = changeset
            elif user_tag in line:
                user = line.split(":")[1].strip()
                changeset_dict[user_tag] = user
            elif date_tag in line:
                date = line.split(":")[1].strip()
                changeset_dict[date_tag] = date
            elif comment_tag in line:
                comment_stage = True
            elif items_tag in line:
                items_stage = True
            elif checkin_tag in line:
                pass                        # not implemented due to example file not containing any data
            elif section_separator in line: # new section
                if first_section:
                    first_section = False
                    continue
                tmp = changeset_dict
                changesets.append(tmp)          
                changeset_dict = {}
                items = []
                # Set stages to false just in case
                items_stage = False
                comment_stage = False
            elif not line.strip():  # empty line
                if items_stage:
                    changeset_dict[items_tag] = items
                    items_stage = False
                comment_stage = False
            else:
                if comment_stage:
                    changeset_dict[comment_tag] = line.strip()  # Only works for one line comment  
                elif items_stage:
                    items.append(line.strip())

    #open and write to the output file
    with open(processed_ctl_file, 'w') as write_file:
        for changeset in changesets:        
            row = "{0}|{1}|{2}|{3}|".format(changeset[changeset_tag], changeset[user_tag], changeset[date_tag], changeset[comment_tag])
            distance = len(row)
            items = changeset[items_tag]
            join_string = "\n" + distance * " "
            items_part = str.join(join_string, items)
            row += items_part + "\n"
            write_file.write(row)

Also, try to use variable names which describes its content. Names like tag1, tag2, etc. does not say much about the variable content. This makes code difficult to read, especially when scripts gets longer. Readability might seem unimportant in most cases, but when re-visiting old code it takes much longer to understand what the code does with non describing variables.

Roman Kutlak · Accepted Answer · 2015-05-20 10:13:31Z

I would start by extracting the values into variables. Then create a prefix from the first few tags. You can count the number of characters in the prefix and use that for the padding. When you get to items, append the first one to the prefix and any other item can be appended to padding created from the number of spaces that you need.

# keywords used in the tag "Items: "
keywords = ['add', 'delete', 'edit', 'source', 'rename']

#passing the count - occurence to the loop
for cs in val.split(tag1)[1:]:
    changeset =  cs.split(tag2)[0].strip()
    user = cs.split(tag2)[1].split(tag3)[0].strip()
    date = cs.split(tag3)[1].split(tag4)[0].strip()
    comment = cs.split(tag4)[1].split(tag5)[0].strip()
    items = cs.split(tag5)[1].split(tag6)[0].strip().split()
    notes = cs.split(tag6)
    prefix = '{0}|{1}|{2}|{3}'.format(changeset, user, date, comment)
    space_count = len(prefix)
    i = 0
    while i < len(items):
        # if we are printing the first item, add it to the other text
        if i == 0:
            pref = prefix
        # otherwise create padding from spaces
        else:
            pref = ' '*space_count
        # add all keywords
        words = ''
        for j in range(i, len(items)):
            if items[j] in keywords:
                words += ' ' + items[j]
            else:
                break
        if i >= len(items): break
        row += '{0}|{1} {2}\n'.format(pref, words, items[j])
        i += j - i + 1 # increase by the number of keywords + the param

This seems to do what you want, but I am not sure if this is the best solution. Maybe it is better to process the file line by line and print the values straight to the stream?

Rob · Accepted Answer · 2015-05-20 10:14:03Z

You can use a regular expression to search for 'add', 'edit' etc.

import re 

#Tags - used for spliting the information 
tag1 = 'Changeset:'
tag2 = 'User:'
tag3 = 'Date:'
tag4 = 'Comment:'
tag5 = 'Items:'
tag6 = 'Check-in Notes:'

#opening and reading the input file
#In path to input file use '\' as escape character
with open ("wibble.txt", "r") as myfile:
    val=myfile.read().replace('\n', ' ') 

#counting the occurence of any one of the above tag
#As count will be same for all the tags
occurence = val.count(tag1)

#initializing row variable
row=""

prevlen = 0

#passing the count - occurence to the loop
for count in  range(1, occurence+1):
   row += ( (val.split(tag1)[count].split(tag2)[0]).strip() + '|' \
    + (val.split(tag2)[count].split(tag3)[0]).strip() + '|' \
    + (val.split(tag3)[count].split(tag4)[0]).strip() + '|' \
    + (val.split(tag4)[count].split(tag5)[0]).strip() + '|' )

   distance = len(row) - prevlen
   row += re.sub("\s\s+([edit]|[add]|[delete]|[rename])", r"\n"+r" "*distance+r"\1", (val.split(tag5)[count].split(tag6)[0])) + '\r'
   prevlen = len(row)

#opening and writing the output file
#In path to output file use '\' as escape character
file = open("wobble.txt", "w+")
file.write(row)
file.close()

Collectives™ on Stack Overflow

file processing in python

3 Answers 3

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related