4

I want to parse a csv file which is in the following format:

Test Environment INFO for 1 line.
Test,TestName1,
TestAttribute1-1,TestAttribute1-2,TestAttribute1-3
TestAttributeValue1-1,TestAttributeValue1-2,TestAttributeValue1-3

Test,TestName2,
TestAttribute2-1,TestAttribute2-2,TestAttribute2-3
TestAttributeValue2-1,TestAttributeValue2-2,TestAttributeValue2-3

Test,TestName3,
TestAttribute3-1,TestAttribute3-2,TestAttribute3-3
TestAttributeValue3-1,TestAttributeValue3-2,TestAttributeValue3-3

Test,TestName4,
TestAttribute4-1,TestAttribute4-2,TestAttribute4-3
TestAttributeValue4-1-1,TestAttributeValue4-1-2,TestAttributeValue4-1-3
TestAttributeValue4-2-1,TestAttributeValue4-2-2,TestAttributeValue4-2-3
TestAttributeValue4-3-1,TestAttributeValue4-3-2,TestAttributeValue4-3-3

and would like to turn this into tab seperated format like in the following:

TestName1
TestAttribute1-1 TestAttributeValue1-1
TestAttribute1-2 TestAttributeValue1-2
TestAttribute1-3 TestAttributeValue1-3

TestName2
TestAttribute2-1 TestAttributeValue2-1
TestAttribute2-2 TestAttributeValue2-2
TestAttribute2-3 TestAttributeValue2-3


TestName3
TestAttribute3-1 TestAttributeValue3-1
TestAttribute3-2 TestAttributeValue3-2
TestAttribute3-3 TestAttributeValue3-3

TestName4
TestAttribute4-1 TestAttributeValue4-1-1 TestAttributeValue4-2-1 TestAttributeValue4-3-1
TestAttribute4-2 TestAttributeValue4-1-2 TestAttributeValue4-2-2 TestAttributeValue4-3-2
TestAttribute4-3 TestAttributeValue4-1-3 TestAttributeValue4-2-3 TestAttributeValue4-3-3

Number of TestAttributes vary from test to test. For some tests there are only 3 values, for some others 7, etc. Also as in TestName4 example, some tests are executed more than once and hence each execution has its own TestAttributeValue line. (in the example testname4 is executed 3 times, hence we have 3 value lines)

I am new to python and do not have much knowledge but would like to parse the csv file with python. I checked 'csv' library of python and could not be sure whether it will be enough for me or shall I write my own string parser? Could you please help me?

Best

4
  • 5
    Did you actually try the csv module? Did it work? If not, what didn't work? Commented Mar 17, 2014 at 14:40
  • 1
    Using csv.reader with the parameter delimiter set to "," will allow you to retrieve the content of the file as lists of strings. From there you'll need to reformat the whole structure. Commented Mar 17, 2014 at 14:43
  • @LutzHorn Actually I could not look in detail to csv module, I hope I will have time in a few hours. However as long as I understood it seems like in my case it is only useful to seperate the texts with the "," in between. So I thought what is the use of that csv module? I can do that by writing a simple text parser which checks whether "," exists or not. I am curious if csv module can be more useful than only finding "," and seperating the values for my case. I do not know if I am looking for magic :) Commented Mar 17, 2014 at 14:55
  • 1
    CSV could also be named DSV: Delimiter Separated Values. The delimiter could also be whitespace. You should 1) find a way to split your input in blocks, and 2) parse these blocks as CSV. Commented Mar 17, 2014 at 15:12

2 Answers 2

2

I'd use a solution using the itertools.groupby function and the csv module. Please have a close look at the documentation of itertools -- you can use it more often than you think!

I've used blank lines to differentiate the datasets, and this approach uses lazy evaluation, storing only one dataset in memory at a time:

import csv
from itertools import groupby

with open('my_data.csv') as ifile, open('my_out_data.csv', 'wb') as ofile:
    # Use the csv module to handle reading and writing of delimited files.
    reader = csv.reader(ifile)
    writer = csv.writer(ofile, delimiter='\t')
    # Skip info line
    next(reader)
    # Group datasets by the condition if len(row) > 0 or not, then filter
    # out all empty lines
    for group in (v for k, v in groupby(reader, lambda x: bool(len(x))) if k):
        test_data = list(group)
        # Write header
        writer.writerow([test_data[0][1]])
        # Write transposed data
        writer.writerows(zip(*test_data[1:]))
        # Write blank line
        writer.writerow([])

Output, given that the supplied data is stored in my_data.csv:

TestName1
TestAttribute1-1    TestAttributeValue1-1
TestAttribute1-2    TestAttributeValue1-2
TestAttribute1-3    TestAttributeValue1-3

TestName2
TestAttribute2-1    TestAttributeValue2-1
TestAttribute2-2    TestAttributeValue2-2
TestAttribute2-3    TestAttributeValue2-3

TestName3
TestAttribute3-1    TestAttributeValue3-1
TestAttribute3-2    TestAttributeValue3-2
TestAttribute3-3    TestAttributeValue3-3

TestName4
TestAttribute4-1    TestAttributeValue4-1-1 TestAttributeValue4-2-1 TestAttributeValue4-3-1
TestAttribute4-2    TestAttributeValue4-1-2 TestAttributeValue4-2-2 TestAttributeValue4-3-2
TestAttribute4-3    TestAttributeValue4-1-3 TestAttributeValue4-2-3 TestAttributeValue4-3-3
Sign up to request clarification or add additional context in comments.

Comments

2

The following does what you want, and only reads up to one section at a time (saves memory for a large file). Replace in_path and out_path with the input and output file paths respectively:

import csv
def print_section(section, f_out):
    if len(section) > 0:
        # find maximum column length
        max_len = max([len(col) for col in section])
        # build and print each row
        for i in xrange(max_len):
            f_out.write('\t'.join([col[i] if len(col) > i else '' for col in section]) + '\n')
        f_out.write('\n')

with csv.reader(open(in_path, 'r')) as f_in, open(out_path, 'w') as f_out:
    line = f_in.next()
    section = []
    for line in f_in:
        # test for new "Test" section
        if len(line) == 3 and line[0] == 'Test' and line[2] == '':
            # write previous section data
            print_section(section, f_out)
            # reset section
            section = []
            # write new section header
            f_out.write(line[1] + '\n')
        else:
            # add line to section
            section.append(line)
    # print the last section
    print_section(section, f_out)

Note that you'll want to change 'Test' in the line[0] == 'Test' statement to the correct word for indicating the header line.

The basic idea here is that we import the file into a list of lists, then write that list of lists back out using an array comprehension to transpose it (as well as adding in blank elements when the columns are uneven).

6 Comments

a) Use the csv module when dealing with delimited files, and b) to transpose a matrix, use zip(*iterable)
@SteinarLima a) Module used now. In this case, though, complexity was not reduced. b) zip(*iterable) silently drops data in uneven columns. In my experience, few users desire data to disappear in that manner.
b) izip_longest from itertools can be used if you don't want that behavior.
@SteinarLima Thanks! I forgot to check itertools. I may update the code above after work today.
The csv module is superior to split(',') in many ways - the most important is that it handles quotation. The line 1,"me, you and him",2 should be split into 3 parts, not 4 for instance.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.