bit of a pickle, would appreciate the help. Trying to validate different csv files that have different header structures. For instance type1.csv has the following
COL1|COL2|COL3|COL4
A1|A2|A3|A4
B1|B2|B3|B4
C1|C2|C3|C4
D1|D2|D3|D4
Using the header=None
df = pd.read_csv(type1.csv, sep='|', header=None)
renders
# 0 1 2 3
0 COL1 COL2 COL3 COL4
1 A1 A2 A3 A4
2 B1 B2 B3 B4
3 C1 C2 C3 C4
4 D1 D2 D3 D4
which is fine as I can issue a replace on the column axis for index 0 (col1, col2, etc)
header = df.columns.values
However if I have another file type2.csv that has the following structure
Datetime|timezone|source|unique identifier
Non Header Row Count = 4 |||
COL1|COL2|COL3|COL4
A1|A2|A3|A4
B1|B2|B3|B4
C1|C2|C3|C4
D1|D2|D3|D4
with headers = None giving me dataframe
# 0 1 2 3
0 Datetime timezone source unique identifier
1 Non Header Row Count = 4 NaN NaN NaN
2 COL1 COL2 COL3 COL4
3 A1 A2 A3 A4
4 B1 B2 B3 B4
5 C1 C2 C3 C4
6 D1 D2 D3 D4
The approach I'd like to implement is to read to a dataframe with headers set to none then iterate through to find the rows that have any values COL1, COL2, COL3 etc and split the dataframe for values above that index perhaps using head(n) where n is the row that contains COL1, COL2, etc regardless of what is above that row (I plan to split this to a new dataframe to run some analysis on the content)
example split
# 0 1 2 3
0 Datetime timezone source unique identifier
1 Non Header Row Count = 4 NaN NaN NaN
# 0 1 2 3
0 COL1 COL2 COL3 COL4
1 A1 A2 A3 A4
2 B1 B2 B3 B4
3 C1 C2 C3 C4
4 D1 D2 D3 D4
Would this be achievable using isin(), or a combination of isin() with a regex or query()? I've search for similar examples and questions but couldn't figure it out to work cleanly (and I'm still getting to grasps with pandas documentation).
I'd like to avoid skiprows as I do want to keep the data above the COL1, COL2, COL3 row for data sanity checks so doing a pre-validation step of reading the file in and determining my header columns position then reading it as a dataframe using skiprows wouldn't be the optimal approach here.
Any help appreciated if you can. Apologies if question isn't clear or I'm making dumb assumptions/have a bad approach. Any criticism, feedback or advice welcome (constructive or otherwise :))