Split CSV into multiple files based on column value

Question

I have a poorly-structured CSV file named file.csv, and I want to split it up into multiple CSV using Python.

|A|B|C|
|Continent||1|
|Family|44950|file1|
|Species|44950|12|
|Habitat||4|
|Species|44950|22|
|Condition|Tue Jan 24 00:00:00 UTC 2023|4|
|Family|Fish|file2|
|Species|Bass|8|
|Species|Trout|2|
|Habitat|River|3|

The new files need to be separated based on everything between the Family rows, so for example:

file1.csv

|A|B|C|
|Continent||1|
|Family|44950|file1|
|Species|44950|12|
|Habitat||4|
|Species|44950|22|
|Condition|Tue Jan 24 00:00:00 UTC 2023|4|

file2.csv

|A|B|C|
|Continent||1|
|Family|Fish|file2|
|Species|Bass|8|
|Species|Trout|2|
|Habitat|River|3|

What's the best way of achieving this when the number of rows between appearances of Species is not consistent?

Timus · Accepted Answer · 2023-01-25 16:03:09Z

1

If your file really looks like that ;) then you could use groupby from the standard library module itertools:

from itertools import groupby

def key(line): return line.startswith("|Family|")

family_line, file_no = None, 0
with open("file.csv", "r") as fin:
    for is_family_line, lines in groupby(fin, key=key):
        if is_family_line:
            family_line = list(lines).pop()
        elif family_line is None:
            header = "".join(lines)
        else:
            file_no += 1
            with open(f"file{file_no}.csv", "w") as fout:
                fout.write(header + family_line)
                for line in lines:
                    fout.write(line)

A Pandas solution would be:

import pandas as pd

df = pd.read_csv("file.csv", header=None, delimiter="|").fillna("")
blocks = df.iloc[:, 1].eq("Family").cumsum()
header_df = df[blocks.eq(0)]
for no, sdf in df.groupby(blocks):
    if no > 0:
        sdf = pd.concat([header_df, sdf])
        sdf.to_csv(f"file{no}.csv", index=False, header=False, sep="|")

edited Jan 25, 2023 at 16:03

answered Jan 25, 2023 at 10:05

Timus

11.4k5 gold badges20 silver badges33 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

user53526356 Over a year ago

Thanks, the pandas solution works well. What if I wanted to keep the header columns persistent throughout each file though?

Timus Over a year ago

@MSD Thanks for the feedback. I've adjusted both solutions to take that into account (seems to work here). In the Pandas version you could use df["A"] instead of df.iloc[:, 1] - but it seemed to me that A might not be the real column label, so I made it a bit more generic.

Timus Over a year ago

@MSD Ah, sorry, I've just realized that you want the first 2 rows as header, not just the first - is that correct?

user53526356 Over a year ago

Yeah that's correct. Sorry I don't think I called that out in my original post (and yes, it's a very gnarly CSV).

Timus Over a year ago

@MSD Okay, thanks: I've updated both versions and think they work.

SyntaxNavigator · Accepted Answer · 2023-01-25 01:53:29Z

0

import pandas as pd
pd.read_csv('file.csv',delimiter='|')
groups = df.groupby('Family')
for name, group in groups:
    group.to_csv(name + '.csv', index=False)

edited Jan 25, 2023 at 1:53

answered Jan 25, 2023 at 1:34

SyntaxNavigator

5434 silver badges14 bronze badges

1 Comment

Timus Over a year ago

That doesn't work: df has no column Family, and even if it had one the results would be completely different from what the question asks.

farshad · Accepted Answer · 2023-01-25 02:27:10Z

0

Here is a pure python working method:

# Read file
with open('file.csv', 'r') as file:
    text = file.read()

# Split using |Family|
splitted_text = text.split("|Family|")

# Remove unwanted content before first |Family|
splitted_text = splitted_text[1:]

# Add |Family| back to each part
splitted_text = ['|Family|' + item for item in splitted_text]

# Write files
for i, content in enumerate(splitted_text ):
    with open('file{}.csv'.format(i), 'w') as file:
        file.write(content)

answered Jan 25, 2023 at 2:27

farshad

80212 silver badges25 bronze badges

Collectives™ on Stack Overflow

Split CSV into multiple files based on column value

3 Answers 3

5 Comments

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

5 Comments

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related