0

I'm fairly new to programming and Python in general. I've a big CSV file that I need to split into multiple CSV files based on the target values of the target column (last column).

Here's a simplified version of the CSV file data that I want to split.

1254.00   1364.00   4562.33   4595.32   1
1235.45   1765.22   4563.45   4862.54   1
6235.23   4563.00   7832.31   5320.36   1
8623.75   5632.09   4586.25   9361.86   0
5659.92   5278.21   8632.02   4567.92   0
4965.25   1983.78   4326.50   7901.10   1
7453.12   4993.20   4573.30   8632.08   1
8963.51   7496.56   4219.36   7456.46   1
9632.23   7591.63   8612.37   4591.00   1
7632.08   4563.85   4632.09   6321.27   0
4693.12   7621.93   5201.37   7693.48   0
6351.96   7216.35   795.52    4109.05   0

I want to split so that the output extracts the data in different csv files like below:

sample1.csv

1254.00   1364.00   4562.33   4595.32   1
1235.45   1765.22   4563.45   4862.54   1
6235.23   4563.00   7832.31   5320.36   1

sample2.csv

8623.75   5632.09   4586.25   9361.86   0
5659.92   5278.21   8632.02   4567.92   0

sample3.csv

4965.25   1983.78   4326.50   7901.10   1
7453.12   4993.20   4573.30   8632.08   1
8963.51   7496.56   4219.36   7456.46   1
9632.23   7591.63   8612.37   4591.00   1

sample4.csv

7632.08   4563.85   4632.09   6321.27   0
4693.12   7621.93   5201.37   7693.48   0
6351.96   7216.35   795.52    4109.05   0

I tried with pandas and some groupby functions but it merges all 1 and 0 together in separate files one containing all values with 1 and another 0, which is not the output that I needed.

Any help would be appreciated.

1
  • 1
    what have you tried? just iterate over file and start writing in a new file every time when value in last column changes... Commented Feb 6, 2019 at 14:43

4 Answers 4

1

What you can do is get the value of the last column in each row. If the value is the same as the value in previous row, add that row to the same list, and if it's not just create a new list and add that row to that empty list. For data structure use list of lists.

Sign up to request clarification or add additional context in comments.

Comments

0

Assume the file 'input.csv' contains the original data.

1254.00   1364.00   4562.33   4595.32   1
1235.45   1765.22   4563.45   4862.54   1
6235.23   4563.00   7832.31   5320.36   1
8623.75   5632.09   4586.25   9361.86   0
5659.92   5278.21   8632.02   4567.92   0
4965.25   1983.78   4326.50   7901.10   1
7453.12   4993.20   4573.30   8632.08   1
8963.51   7496.56   4219.36   7456.46   1
9632.23   7591.63   8612.37   4591.00   1
7632.08   4563.85   4632.09   6321.27   0
4693.12   7621.93   5201.37   7693.48   0
6351.96   7216.35   795.52    4109.05   0

code below

target = None
counter = 0
with open('input.csv', 'r') as file_in:
    lines = file_in.readlines()
    tmp = []
    for idx, line in enumerate(lines):
        _target = line.split(' ')[-1].strip()
        if idx == 0:
            tmp.append(line)
            target = _target
            continue
        else:
            last_line = idx + 1 == len(lines)
            if _target != target or last_line:
                if last_line:
                    tmp.append(line)
                counter += 1
                with open('sample{}.csv'.format(counter), 'w') as file_out:
                    file_out.writelines(tmp)
                tmp = [line]
            else:
                tmp.append(line)
            target = _target

7 Comments

Your output is similar to what my wrong output was. I didn't want my outputs to be in only 2 files one containing 0 and other 1. You can check the desired output I have given in the question. Still thanks!
You have asked to create output files based on the 'target' which is the last column. This is what the code does.. Please explain what should make the code create files like sample3.csv,etc when there are only two targets [0,1] in the data source you have provided.
Thank you for your effort. The iteration I wanted was, in the data, we see the first 3 rows have target value 1. So, sample1.csv file should contain the first 3 rows. When the target value changes from 1 to 0, it should create a new sample2.csv with the next two rows that contains target values 0. Then when the iterator finds the target value changed from 0 to 1 (in the 6th row), it should create a new sample3.csv and put the next rows with target values 1 and so on. I hope it clarified. Please take a look at my original question. There I've explained what I wanted for the output. Thanks!
OK... got it. Code was modified. have a look.
Thanks. I tried your modified one. But I'm getting 12 sample.csv files instead of 4 csv files. Because each row is now being created as a csv file, instead of the groups of target values. Your output: sample1.csv 1254.00 1364.00 4562.33 4595.32 1 sample2.csv 1235.45 1765.22 4563.45 4862.54 1 Where I wanted: sample1.csv 1254.00 1364.00 4562.33 4595.32 1 1235.45 1765.22 4563.45 4862.54 1 6235.23 4563.00 7832.31 5320.36 1
|
0

Perhaps you want something like this:

from itertools import groupby
from operator import itemgetter

sep = '   '

with open('data.csv') as f:
    data = f.read()

split_data = [row.split(sep) for row in data.split('\n')]
gb = groupby(split_data, key=itemgetter(4))

for index, (key, group) in enumerate(gb):
    with open('sample{}.csv'.format(index), 'w') as f:
        write_data = '\n'.join(sep.join(cell) for cell in group)
        f.write(write_data)

Unlike pd.groupby, itertools.groupby doesn't sort the source beforehand. This parses the input CSV into a list of lists and performs a groupby on the outer list based on the 5th column, which contains the target. The groupby object is an iterator over the groups; by writing each group to a different file, the result you want can be achieved.

3 Comments

Thanks. But I'm getting an index error for your line 10. IndexError: list index out of range
@MishkatRahman the problem is probably that the code I gave assumes that the source file is formatted exactly as you stated in the question (with 3 spaces between elements.) If it is truly a CSV as you state, you would need to change the sep value to something else.
Okay. Thanks Marcus for all the help!
0

I propose to use a function to do what was asked for.

There is the possibility of leaving unreferenced the file objects that we have opened for writing, so that they are automatically closed when garbage collected but here I prefer to explicitly close every output file before opening another one.

The script is heavily commented, so no further explanations:

def split_data(data_fname, key_len=1, basename='file%03d.txt')

    data = open(data_fname)

    current_output = None # because we have yet not opened an output file
    prev_key = int(1)     # because a string is always different from an int
    count = 0             # because we want to count the output files

    for line in data:

        # line has a trailing newline so that to extract the key
        # we have to take into account that
        key = line[-key_len-1:-1]

        if key !=  prev_key     # key has changed!

           count += 1           # a new file is going to be opened
           prev_key = key       # remember the new key
           if current_output:   # if a file was opened, close it
               current_output.close()
           # open a new output file, its name derived from the variable count
           current_output = open(basename%count, 'w')

        # now we can write to the output file
        current_output.write(line)
        # note that line is already newline terminated

    # clean up what is still going
    current_output.close()

This answer has an history.

2 Comments

Thanks gboffi for all the explanations. May I ask, in your modified version, what should I do with f.write(line)? Because we don't have any reference f. Where in your previous code, f = None.
I forgot a name conversion when refactoring... Ouch! — Of course it should be current_output.write(line) because that's where we want to write the line we are processing. I've edited the answer, my apologies for the mistake and the ensuing confusion.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.