3

bit of a pickle, would appreciate the help. Trying to validate different csv files that have different header structures. For instance type1.csv has the following

COL1|COL2|COL3|COL4
A1|A2|A3|A4
B1|B2|B3|B4
C1|C2|C3|C4
D1|D2|D3|D4

Using the header=None

df = pd.read_csv(type1.csv, sep='|', header=None)

renders

#     0     1     2     3
0  COL1  COL2  COL3  COL4
1    A1    A2    A3    A4
2    B1    B2    B3    B4
3    C1    C2    C3    C4
4    D1    D2    D3    D4

which is fine as I can issue a replace on the column axis for index 0 (col1, col2, etc)

header = df.columns.values

However if I have another file type2.csv that has the following structure

Datetime|timezone|source|unique identifier
Non Header Row Count = 4 |||
COL1|COL2|COL3|COL4
A1|A2|A3|A4
B1|B2|B3|B4
C1|C2|C3|C4
D1|D2|D3|D4

with headers = None giving me dataframe

#                          0         1       2                  3
0                   Datetime  timezone  source  unique identifier
1  Non Header Row Count = 4        NaN     NaN                NaN
2                       COL1      COL2    COL3               COL4
3                         A1        A2      A3                 A4
4                         B1        B2      B3                 B4
5                         C1        C2      C3                 C4
6                         D1        D2      D3                 D4

The approach I'd like to implement is to read to a dataframe with headers set to none then iterate through to find the rows that have any values COL1, COL2, COL3 etc and split the dataframe for values above that index perhaps using head(n) where n is the row that contains COL1, COL2, etc regardless of what is above that row (I plan to split this to a new dataframe to run some analysis on the content)

example split

#                          0         1       2                  3
0                   Datetime  timezone  source  unique identifier
1  Non Header Row Count = 4        NaN     NaN                NaN
#     0     1     2     3
0  COL1  COL2  COL3  COL4
1    A1    A2    A3    A4
2    B1    B2    B3    B4
3    C1    C2    C3    C4
4    D1    D2    D3    D4

Would this be achievable using isin(), or a combination of isin() with a regex or query()? I've search for similar examples and questions but couldn't figure it out to work cleanly (and I'm still getting to grasps with pandas documentation).

I'd like to avoid skiprows as I do want to keep the data above the COL1, COL2, COL3 row for data sanity checks so doing a pre-validation step of reading the file in and determining my header columns position then reading it as a dataframe using skiprows wouldn't be the optimal approach here.

Any help appreciated if you can. Apologies if question isn't clear or I'm making dumb assumptions/have a bad approach. Any criticism, feedback or advice welcome (constructive or otherwise :))

1 Answer 1

1

You can use:

import pandas as pd
import io

temp=u"""Datetime|timezone|source|unique identifier
Non Header Row Count = 4 |||
COL1|COL2|COL3|COL4
A1|A2|A3|A4
B1|B2|B3|B4
C1|C2|C3|C4
D1|D2|D3|D4"""
#after testing replace io.StringIO(temp) to filename
df1 = pd.read_csv(io.StringIO(temp), sep="|")
print df1
                    Datetime timezone source unique identifier
0  Non Header Row Count = 4       NaN    NaN               NaN
1                       COL1     COL2   COL3              COL4
2                         A1       A2     A3                A4
3                         B1       B2     B3                B4
4                         C1       C2     C3                C4
5                         D1       D2     D3                D4

df2 = df1[2:]
df2.columns = df1.loc[1,:]
df2 = df2.reset_index(drop=True).rename_axis(None, axis=1)
print df2
  COL1 COL2 COL3 COL4
0   A1   A2   A3   A4
1   B1   B2   B3   B4
2   C1   C2   C3   C4
3   D1   D2   D3   D4

print df1[:1]
                    Datetime timezone source unique identifier
0  Non Header Row Count = 4       NaN    NaN               NaN

EDIT:

And then you can find index where is COL1 in first column by contains with boolean indexing:

col = df1[df1.iloc[:,0].str.contains('COL1')].index.tolist()[0]
print col
1

df2 = df1[col+1:]
df2.columns = df1.loc[col,:]
df2 = df2.reset_index(drop=True).rename_axis(None, axis=1)
print df2
  COL1 COL2 COL3 COL4
0   A1   A2   A3   A4
1   B1   B2   B3   B4
2   C1   C2   C3   C4
3   D1   D2   D3   D4

print df1[:col]
                    Datetime timezone source unique identifier
0  Non Header Row Count = 4       NaN    NaN               NaN
Sign up to request clarification or add additional context in comments.

1 Comment

This worked very well. I didn't think of approaching it like that. Thanks for the feedback and advice. I will try to expand upon this and provide additional feedback :)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.