CSV parsing in Python

Question

I want to parse a csv file which is in the following format:

Test Environment INFO for 1 line.
Test,TestName1,
TestAttribute1-1,TestAttribute1-2,TestAttribute1-3
TestAttributeValue1-1,TestAttributeValue1-2,TestAttributeValue1-3

Test,TestName2,
TestAttribute2-1,TestAttribute2-2,TestAttribute2-3
TestAttributeValue2-1,TestAttributeValue2-2,TestAttributeValue2-3

Test,TestName3,
TestAttribute3-1,TestAttribute3-2,TestAttribute3-3
TestAttributeValue3-1,TestAttributeValue3-2,TestAttributeValue3-3

Test,TestName4,
TestAttribute4-1,TestAttribute4-2,TestAttribute4-3
TestAttributeValue4-1-1,TestAttributeValue4-1-2,TestAttributeValue4-1-3
TestAttributeValue4-2-1,TestAttributeValue4-2-2,TestAttributeValue4-2-3
TestAttributeValue4-3-1,TestAttributeValue4-3-2,TestAttributeValue4-3-3

and would like to turn this into tab seperated format like in the following:

TestName1
TestAttribute1-1 TestAttributeValue1-1
TestAttribute1-2 TestAttributeValue1-2
TestAttribute1-3 TestAttributeValue1-3

TestName2
TestAttribute2-1 TestAttributeValue2-1
TestAttribute2-2 TestAttributeValue2-2
TestAttribute2-3 TestAttributeValue2-3


TestName3
TestAttribute3-1 TestAttributeValue3-1
TestAttribute3-2 TestAttributeValue3-2
TestAttribute3-3 TestAttributeValue3-3

TestName4
TestAttribute4-1 TestAttributeValue4-1-1 TestAttributeValue4-2-1 TestAttributeValue4-3-1
TestAttribute4-2 TestAttributeValue4-1-2 TestAttributeValue4-2-2 TestAttributeValue4-3-2
TestAttribute4-3 TestAttributeValue4-1-3 TestAttributeValue4-2-3 TestAttributeValue4-3-3

Number of TestAttributes vary from test to test. For some tests there are only 3 values, for some others 7, etc. Also as in TestName4 example, some tests are executed more than once and hence each execution has its own TestAttributeValue line. (in the example testname4 is executed 3 times, hence we have 3 value lines)

I am new to python and do not have much knowledge but would like to parse the csv file with python. I checked 'csv' library of python and could not be sure whether it will be enough for me or shall I write my own string parser? Could you please help me?

Best

Did you actually try the csv module? Did it work? If not, what didn't work? — user1907906
– user1907906, Commented Mar 17, 2014 at 14:40
Using csv.reader with the parameter delimiter set to "," will allow you to retrieve the content of the file as lists of strings. From there you'll need to reformat the whole structure. — El Bert
– El Bert, Commented Mar 17, 2014 at 14:43
@LutzHorn Actually I could not look in detail to csv module, I hope I will have time in a few hours. However as long as I understood it seems like in my case it is only useful to seperate the texts with the "," in between. So I thought what is the use of that csv module? I can do that by writing a simple text parser which checks whether "," exists or not. I am curious if csv module can be more useful than only finding "," and seperating the values for my case. I do not know if I am looking for magic :) — Xentius
– Xentius, Commented Mar 17, 2014 at 14:55
CSV could also be named DSV: Delimiter Separated Values. The delimiter could also be whitespace. You should 1) find a way to split your input in blocks, and 2) parse these blocks as CSV. — user1907906
– user1907906, Commented Mar 17, 2014 at 15:12

Steinar Lima · Accepted Answer · 2014-03-20 04:05:58Z

I'd use a solution using the itertools.groupby function and the csv module. Please have a close look at the documentation of itertools -- you can use it more often than you think!

I've used blank lines to differentiate the datasets, and this approach uses lazy evaluation, storing only one dataset in memory at a time:

import csv
from itertools import groupby

with open('my_data.csv') as ifile, open('my_out_data.csv', 'wb') as ofile:
    # Use the csv module to handle reading and writing of delimited files.
    reader = csv.reader(ifile)
    writer = csv.writer(ofile, delimiter='\t')
    # Skip info line
    next(reader)
    # Group datasets by the condition if len(row) > 0 or not, then filter
    # out all empty lines
    for group in (v for k, v in groupby(reader, lambda x: bool(len(x))) if k):
        test_data = list(group)
        # Write header
        writer.writerow([test_data[0][1]])
        # Write transposed data
        writer.writerows(zip(*test_data[1:]))
        # Write blank line
        writer.writerow([])

Output, given that the supplied data is stored in my_data.csv:

TestName1
TestAttribute1-1    TestAttributeValue1-1
TestAttribute1-2    TestAttributeValue1-2
TestAttribute1-3    TestAttributeValue1-3

TestName2
TestAttribute2-1    TestAttributeValue2-1
TestAttribute2-2    TestAttributeValue2-2
TestAttribute2-3    TestAttributeValue2-3

TestName3
TestAttribute3-1    TestAttributeValue3-1
TestAttribute3-2    TestAttributeValue3-2
TestAttribute3-3    TestAttributeValue3-3

TestName4
TestAttribute4-1    TestAttributeValue4-1-1 TestAttributeValue4-2-1 TestAttributeValue4-3-1
TestAttribute4-2    TestAttributeValue4-1-2 TestAttributeValue4-2-2 TestAttributeValue4-3-2
TestAttribute4-3    TestAttributeValue4-1-3 TestAttributeValue4-2-3 TestAttributeValue4-3-3

Pi Marillion · Accepted Answer · 2014-03-20 04:41:26Z

2

The following does what you want, and only reads up to one section at a time (saves memory for a large file). Replace in_path and out_path with the input and output file paths respectively:

import csv
def print_section(section, f_out):
    if len(section) > 0:
        # find maximum column length
        max_len = max([len(col) for col in section])
        # build and print each row
        for i in xrange(max_len):
            f_out.write('\t'.join([col[i] if len(col) > i else '' for col in section]) + '\n')
        f_out.write('\n')

with csv.reader(open(in_path, 'r')) as f_in, open(out_path, 'w') as f_out:
    line = f_in.next()
    section = []
    for line in f_in:
        # test for new "Test" section
        if len(line) == 3 and line[0] == 'Test' and line[2] == '':
            # write previous section data
            print_section(section, f_out)
            # reset section
            section = []
            # write new section header
            f_out.write(line[1] + '\n')
        else:
            # add line to section
            section.append(line)
    # print the last section
    print_section(section, f_out)

Note that you'll want to change 'Test' in the line[0] == 'Test' statement to the correct word for indicating the header line.

The basic idea here is that we import the file into a list of lists, then write that list of lists back out using an array comprehension to transpose it (as well as adding in blank elements when the columns are uneven).

edited Mar 20, 2014 at 4:41

answered Mar 20, 2014 at 1:19

Pi Marillion

4,6941 gold badge22 silver badges23 bronze badges

6 Comments

Steinar Lima Over a year ago

a) Use the csv module when dealing with delimited files, and b) to transpose a matrix, use zip(*iterable)

Pi Marillion Over a year ago

@SteinarLima a) Module used now. In this case, though, complexity was not reduced. b) zip(*iterable) silently drops data in uneven columns. In my experience, few users desire data to disappear in that manner.

Steinar Lima Over a year ago

b) izip_longest from itertools can be used if you don't want that behavior.

Pi Marillion Over a year ago

@SteinarLima Thanks! I forgot to check itertools. I may update the code above after work today.

Steinar Lima Over a year ago

The csv module is superior to split(',') in many ways - the most important is that it handles quotation. The line 1,"me, you and him",2 should be split into 3 parts, not 4 for instance.

|

Collectives™ on Stack Overflow

CSV parsing in Python

2 Answers 2

Comments

6 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

6 Comments

Your Answer

Sign up or log in

Post as a guest

Related