1

I have a poorly-structured CSV file named file.csv, and I want to split it up into multiple CSV using Python.

|A|B|C|
|Continent||1|
|Family|44950|file1|
|Species|44950|12|
|Habitat||4|
|Species|44950|22|
|Condition|Tue Jan 24 00:00:00 UTC 2023|4|
|Family|Fish|file2|
|Species|Bass|8|
|Species|Trout|2|
|Habitat|River|3|

The new files need to be separated based on everything between the Family rows, so for example:

file1.csv

|A|B|C|
|Continent||1|
|Family|44950|file1|
|Species|44950|12|
|Habitat||4|
|Species|44950|22|
|Condition|Tue Jan 24 00:00:00 UTC 2023|4|

file2.csv

|A|B|C|
|Continent||1|
|Family|Fish|file2|
|Species|Bass|8|
|Species|Trout|2|
|Habitat|River|3|

What's the best way of achieving this when the number of rows between appearances of Species is not consistent?

3 Answers 3

1

If your file really looks like that ;) then you could use groupby from the standard library module itertools:

from itertools import groupby

def key(line): return line.startswith("|Family|")

family_line, file_no = None, 0
with open("file.csv", "r") as fin:
    for is_family_line, lines in groupby(fin, key=key):
        if is_family_line:
            family_line = list(lines).pop()
        elif family_line is None:
            header = "".join(lines)
        else:
            file_no += 1
            with open(f"file{file_no}.csv", "w") as fout:
                fout.write(header + family_line)
                for line in lines:
                    fout.write(line)

A Pandas solution would be:

import pandas as pd

df = pd.read_csv("file.csv", header=None, delimiter="|").fillna("")
blocks = df.iloc[:, 1].eq("Family").cumsum()
header_df = df[blocks.eq(0)]
for no, sdf in df.groupby(blocks):
    if no > 0:
        sdf = pd.concat([header_df, sdf])
        sdf.to_csv(f"file{no}.csv", index=False, header=False, sep="|")
Sign up to request clarification or add additional context in comments.

5 Comments

Thanks, the pandas solution works well. What if I wanted to keep the header columns persistent throughout each file though?
@MSD Thanks for the feedback. I've adjusted both solutions to take that into account (seems to work here). In the Pandas version you could use df["A"] instead of df.iloc[:, 1] - but it seemed to me that A might not be the real column label, so I made it a bit more generic.
@MSD Ah, sorry, I've just realized that you want the first 2 rows as header, not just the first - is that correct?
Yeah that's correct. Sorry I don't think I called that out in my original post (and yes, it's a very gnarly CSV).
@MSD Okay, thanks: I've updated both versions and think they work.
0
import pandas as pd
pd.read_csv('file.csv',delimiter='|')
groups = df.groupby('Family')
for name, group in groups:
    group.to_csv(name + '.csv', index=False)

1 Comment

That doesn't work: df has no column Family, and even if it had one the results would be completely different from what the question asks.
0

Here is a pure python working method:

# Read file
with open('file.csv', 'r') as file:
    text = file.read()

# Split using |Family|
splitted_text = text.split("|Family|")

# Remove unwanted content before first |Family|
splitted_text = splitted_text[1:]

# Add |Family| back to each part
splitted_text = ['|Family|' + item for item in splitted_text]

# Write files
for i, content in enumerate(splitted_text ):
    with open('file{}.csv'.format(i), 'w') as file:
        file.write(content)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.