Determine the header in a csv file using pandas if header=None

Question

bit of a pickle, would appreciate the help. Trying to validate different csv files that have different header structures. For instance type1.csv has the following

COL1|COL2|COL3|COL4
A1|A2|A3|A4
B1|B2|B3|B4
C1|C2|C3|C4
D1|D2|D3|D4

Using the header=None

df = pd.read_csv(type1.csv, sep='|', header=None)

renders

#     0     1     2     3
0  COL1  COL2  COL3  COL4
1    A1    A2    A3    A4
2    B1    B2    B3    B4
3    C1    C2    C3    C4
4    D1    D2    D3    D4

which is fine as I can issue a replace on the column axis for index 0 (col1, col2, etc)

header = df.columns.values

However if I have another file type2.csv that has the following structure

Datetime|timezone|source|unique identifier
Non Header Row Count = 4 |||
COL1|COL2|COL3|COL4
A1|A2|A3|A4
B1|B2|B3|B4
C1|C2|C3|C4
D1|D2|D3|D4

with headers = None giving me dataframe

#                          0         1       2                  3
0                   Datetime  timezone  source  unique identifier
1  Non Header Row Count = 4        NaN     NaN                NaN
2                       COL1      COL2    COL3               COL4
3                         A1        A2      A3                 A4
4                         B1        B2      B3                 B4
5                         C1        C2      C3                 C4
6                         D1        D2      D3                 D4

The approach I'd like to implement is to read to a dataframe with headers set to none then iterate through to find the rows that have any values COL1, COL2, COL3 etc and split the dataframe for values above that index perhaps using head(n) where n is the row that contains COL1, COL2, etc regardless of what is above that row (I plan to split this to a new dataframe to run some analysis on the content)

example split

#                          0         1       2                  3
0                   Datetime  timezone  source  unique identifier
1  Non Header Row Count = 4        NaN     NaN                NaN

#     0     1     2     3
0  COL1  COL2  COL3  COL4
1    A1    A2    A3    A4
2    B1    B2    B3    B4
3    C1    C2    C3    C4
4    D1    D2    D3    D4

Would this be achievable using isin(), or a combination of isin() with a regex or query()? I've search for similar examples and questions but couldn't figure it out to work cleanly (and I'm still getting to grasps with pandas documentation).

I'd like to avoid skiprows as I do want to keep the data above the COL1, COL2, COL3 row for data sanity checks so doing a pre-validation step of reading the file in and determining my header columns position then reading it as a dataframe using skiprows wouldn't be the optimal approach here.

Any help appreciated if you can. Apologies if question isn't clear or I'm making dumb assumptions/have a bad approach. Any criticism, feedback or advice welcome (constructive or otherwise :))

jezrael · Accepted Answer · 2016-05-13 09:42:52Z

1

You can use:

import pandas as pd
import io

temp=u"""Datetime|timezone|source|unique identifier
Non Header Row Count = 4 |||
COL1|COL2|COL3|COL4
A1|A2|A3|A4
B1|B2|B3|B4
C1|C2|C3|C4
D1|D2|D3|D4"""
#after testing replace io.StringIO(temp) to filename
df1 = pd.read_csv(io.StringIO(temp), sep="|")
print df1
                    Datetime timezone source unique identifier
0  Non Header Row Count = 4       NaN    NaN               NaN
1                       COL1     COL2   COL3              COL4
2                         A1       A2     A3                A4
3                         B1       B2     B3                B4
4                         C1       C2     C3                C4
5                         D1       D2     D3                D4

df2 = df1[2:]
df2.columns = df1.loc[1,:]
df2 = df2.reset_index(drop=True).rename_axis(None, axis=1)
print df2
  COL1 COL2 COL3 COL4
0   A1   A2   A3   A4
1   B1   B2   B3   B4
2   C1   C2   C3   C4
3   D1   D2   D3   D4

print df1[:1]
                    Datetime timezone source unique identifier
0  Non Header Row Count = 4       NaN    NaN               NaN

EDIT:

And then you can find index where is COL1 in first column by contains with boolean indexing:

col = df1[df1.iloc[:,0].str.contains('COL1')].index.tolist()[0]
print col
1

df2 = df1[col+1:]
df2.columns = df1.loc[col,:]
df2 = df2.reset_index(drop=True).rename_axis(None, axis=1)
print df2
  COL1 COL2 COL3 COL4
0   A1   A2   A3   A4
1   B1   B2   B3   B4
2   C1   C2   C3   C4
3   D1   D2   D3   D4

print df1[:col]
                    Datetime timezone source unique identifier
0  Non Header Row Count = 4       NaN    NaN               NaN

edited May 13, 2016 at 9:42

answered May 13, 2016 at 9:27

jezrael

868k103 gold badges1.4k silver badges1.3k bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Base Starr Over a year ago

This worked very well. I didn't think of approaching it like that. Thanks for the feedback and advice. I will try to expand upon this and provide additional feedback :)

Collectives™ on Stack Overflow

Determine the header in a csv file using pandas if header=None

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related