Splitting a CSV file into multiple csv by target columns values

Question

I'm fairly new to programming and Python in general. I've a big CSV file that I need to split into multiple CSV files based on the target values of the target column (last column).

Here's a simplified version of the CSV file data that I want to split.

1254.00   1364.00   4562.33   4595.32   1
1235.45   1765.22   4563.45   4862.54   1
6235.23   4563.00   7832.31   5320.36   1
8623.75   5632.09   4586.25   9361.86   0
5659.92   5278.21   8632.02   4567.92   0
4965.25   1983.78   4326.50   7901.10   1
7453.12   4993.20   4573.30   8632.08   1
8963.51   7496.56   4219.36   7456.46   1
9632.23   7591.63   8612.37   4591.00   1
7632.08   4563.85   4632.09   6321.27   0
4693.12   7621.93   5201.37   7693.48   0
6351.96   7216.35   795.52    4109.05   0

I want to split so that the output extracts the data in different csv files like below:

sample1.csv

1254.00   1364.00   4562.33   4595.32   1
1235.45   1765.22   4563.45   4862.54   1
6235.23   4563.00   7832.31   5320.36   1

sample2.csv

8623.75   5632.09   4586.25   9361.86   0
5659.92   5278.21   8632.02   4567.92   0

sample3.csv

4965.25   1983.78   4326.50   7901.10   1
7453.12   4993.20   4573.30   8632.08   1
8963.51   7496.56   4219.36   7456.46   1
9632.23   7591.63   8612.37   4591.00   1

sample4.csv

7632.08   4563.85   4632.09   6321.27   0
4693.12   7621.93   5201.37   7693.48   0
6351.96   7216.35   795.52    4109.05   0

I tried with pandas and some groupby functions but it merges all 1 and 0 together in separate files one containing all values with 1 and another 0, which is not the output that I needed.

Any help would be appreciated.

what have you tried? just iterate over file and start writing in a new file every time when value in last column changes... — buran
– buran, Commented Feb 6, 2019 at 14:43

Novak · Accepted Answer · 2019-02-06 14:43:27Z

1

What you can do is get the value of the last column in each row. If the value is the same as the value in previous row, add that row to the same list, and if it's not just create a new list and add that row to that empty list. For data structure use list of lists.

answered Feb 6, 2019 at 14:43

Novak

2,1611 gold badge13 silver badges22 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

balderman · Accepted Answer · 2019-02-06 17:09:00Z

0

Assume the file 'input.csv' contains the original data.

1254.00   1364.00   4562.33   4595.32   1
1235.45   1765.22   4563.45   4862.54   1
6235.23   4563.00   7832.31   5320.36   1
8623.75   5632.09   4586.25   9361.86   0
5659.92   5278.21   8632.02   4567.92   0
4965.25   1983.78   4326.50   7901.10   1
7453.12   4993.20   4573.30   8632.08   1
8963.51   7496.56   4219.36   7456.46   1
9632.23   7591.63   8612.37   4591.00   1
7632.08   4563.85   4632.09   6321.27   0
4693.12   7621.93   5201.37   7693.48   0
6351.96   7216.35   795.52    4109.05   0

code below

target = None
counter = 0
with open('input.csv', 'r') as file_in:
    lines = file_in.readlines()
    tmp = []
    for idx, line in enumerate(lines):
        _target = line.split(' ')[-1].strip()
        if idx == 0:
            tmp.append(line)
            target = _target
            continue
        else:
            last_line = idx + 1 == len(lines)
            if _target != target or last_line:
                if last_line:
                    tmp.append(line)
                counter += 1
                with open('sample{}.csv'.format(counter), 'w') as file_out:
                    file_out.writelines(tmp)
                tmp = [line]
            else:
                tmp.append(line)
            target = _target

edited Feb 6, 2019 at 17:09

answered Feb 6, 2019 at 14:46

balderman

24k8 gold badges39 silver badges60 bronze badges

7 Comments

Mishkat Rahman Over a year ago

Your output is similar to what my wrong output was. I didn't want my outputs to be in only 2 files one containing 0 and other 1. You can check the desired output I have given in the question. Still thanks!

balderman Over a year ago

You have asked to create output files based on the 'target' which is the last column. This is what the code does.. Please explain what should make the code create files like sample3.csv,etc when there are only two targets [0,1] in the data source you have provided.

Mishkat Rahman Over a year ago

Thank you for your effort. The iteration I wanted was, in the data, we see the first 3 rows have target value 1. So, sample1.csv file should contain the first 3 rows. When the target value changes from 1 to 0, it should create a new sample2.csv with the next two rows that contains target values 0. Then when the iterator finds the target value changed from 0 to 1 (in the 6th row), it should create a new sample3.csv and put the next rows with target values 1 and so on. I hope it clarified. Please take a look at my original question. There I've explained what I wanted for the output. Thanks!

balderman Over a year ago

OK... got it. Code was modified. have a look.

Mishkat Rahman Over a year ago

Thanks. I tried your modified one. But I'm getting 12 sample.csv files instead of 4 csv files. Because each row is now being created as a csv file, instead of the groups of target values. Your output: sample1.csv 1254.00 1364.00 4562.33 4595.32 1 sample2.csv 1235.45 1765.22 4563.45 4862.54 1 Where I wanted: sample1.csv 1254.00 1364.00 4562.33 4595.32 1 1235.45 1765.22 4563.45 4862.54 1 6235.23 4563.00 7832.31 5320.36 1

|

gmds · Accepted Answer · 2019-02-06 22:48:31Z

0

Perhaps you want something like this:

from itertools import groupby
from operator import itemgetter

sep = '   '

with open('data.csv') as f:
    data = f.read()

split_data = [row.split(sep) for row in data.split('\n')]
gb = groupby(split_data, key=itemgetter(4))

for index, (key, group) in enumerate(gb):
    with open('sample{}.csv'.format(index), 'w') as f:
        write_data = '\n'.join(sep.join(cell) for cell in group)
        f.write(write_data)

Unlike pd.groupby, itertools.groupby doesn't sort the source beforehand. This parses the input CSV into a list of lists and performs a groupby on the outer list based on the 5th column, which contains the target. The groupby object is an iterator over the groups; by writing each group to a different file, the result you want can be achieved.

edited Feb 6, 2019 at 22:48

answered Feb 6, 2019 at 14:47

gmds

20k4 gold badges37 silver badges65 bronze badges

3 Comments

Mishkat Rahman Over a year ago

Thanks. But I'm getting an index error for your line 10. IndexError: list index out of range

gmds Over a year ago

@MishkatRahman the problem is probably that the code I gave assumes that the source file is formatted exactly as you stated in the question (with 3 spaces between elements.) If it is truly a CSV as you state, you would need to change the sep value to something else.

Mishkat Rahman Over a year ago

Okay. Thanks Marcus for all the help!

gboffi · Accepted Answer · 2019-02-14 12:12:13Z

0

I propose to use a function to do what was asked for.

There is the possibility of leaving unreferenced the file objects that we have opened for writing, so that they are automatically closed when garbage collected but here I prefer to explicitly close every output file before opening another one.

The script is heavily commented, so no further explanations:

def split_data(data_fname, key_len=1, basename='file%03d.txt')

    data = open(data_fname)

    current_output = None # because we have yet not opened an output file
    prev_key = int(1)     # because a string is always different from an int
    count = 0             # because we want to count the output files

    for line in data:

        # line has a trailing newline so that to extract the key
        # we have to take into account that
        key = line[-key_len-1:-1]

        if key !=  prev_key     # key has changed!

           count += 1           # a new file is going to be opened
           prev_key = key       # remember the new key
           if current_output:   # if a file was opened, close it
               current_output.close()
           # open a new output file, its name derived from the variable count
           current_output = open(basename%count, 'w')

        # now we can write to the output file
        current_output.write(line)
        # note that line is already newline terminated

    # clean up what is still going
    current_output.close()

_{This answer has an history.}

edited Feb 14, 2019 at 12:12

answered Feb 6, 2019 at 15:12

gboffi

25.4k10 gold badges62 silver badges98 bronze badges

2 Comments

Mishkat Rahman Over a year ago

Thanks gboffi for all the explanations. May I ask, in your modified version, what should I do with f.write(line)? Because we don't have any reference f. Where in your previous code, f = None.

gboffi Over a year ago

I forgot a name conversion when refactoring... Ouch! — Of course it should be current_output.write(line) because that's where we want to write the line we are processing. I've edited the answer, my apologies for the mistake and the ensuing confusion.

Collectives™ on Stack Overflow

Splitting a CSV file into multiple csv by target columns values

4 Answers 4

Comments

7 Comments

3 Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Comments

7 Comments

3 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related