Pandas read excel sheet with multiple header when first column is empty

Question

I have an excel sheet like this:

I want to read it with pandas read_excel and I tried this:

df = pd.read_excel("test.xlsx", header=[0,1])

but it throws me this error:

ParserError: Passed header=[0,1] are too many rows for this multi_index of columns

Any suggestions?

Are you using Merged Cells for Header 1 and Header 2? If yes, try to go without them. — Vityata
– Vityata, Commented May 22, 2018 at 16:51
I gotta say it was kind of a let-down after that title when i realized that this question had nothing to do with large black and white bears. — Ethan The Brave
– Ethan The Brave, Commented May 22, 2018 at 17:14

Orenshi · Accepted Answer · 2018-05-23 16:57:27Z

6

If you don't mind massaging the DataFrame after reading the Excel you can try the below two ways:

>>> pd.read_excel("/tmp/sample.xlsx", usecols = "B:F", skiprows=[0])
  header1 Unnamed: 1 Unnamed: 2 header2 Unnamed: 4
0    col1       col2       col3    col4       col5
1       a          0          x       3          d
2       b          1          y       4          e
3       c          2          z       5          f

In above, you'd have to fix the first level of the MultiIndex since header1 and header2 are merged cells

>>> pd.read_excel("/tmp/sample.xlsx", header=[0,1], usecols = "B:F", 
skiprows=[0])
        header1      header2
header1    col1 col2    col3 col4
a             0    x       3    d
b             1    y       4    e
c             2    z       5    f

In above, it got pretty close by skipping the empty row and parsing only columns (B:F) with data. If you notice, the columns got shifted though...

Note Not a clean solution but just wanted to share samples with you in a post rather than a comment

-- Edit based on discussion with OP --

Based on documentation for pandas read_excel, header[1,2] is creating a MultiIndex for your columns. Looks like it determines the labels for the DataFrame depending on what is populated in Column A. Since there's nothing there... the index has a bunch of Nan like so

>>> pd.read_excel("/tmp/sample.xlsx", header=[1,2])
    header1           header2
       col1 col2 col3    col4 col5
NaN       a    0    x       3    d
NaN       b    1    y       4    e
NaN       c    2    z       5    f

Again if you're okay with cleaning up columns and if the first column of the xlsx is always blank... you can drop it like below. Hopefully this is what you're looking for.

>>> pd.read_excel("/tmp/sample.xlsx", header[1,2]).reset_index().drop(['index'], level=0, axis=1)
  header1           header2
     col1 col2 col3    col4 col5
0       a    0    x       3    d
1       b    1    y       4    e
2       c    2    z       5    f

edited May 23, 2018 at 16:57

answered May 22, 2018 at 17:04

Orenshi

1,87314 silver badges13 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Alexandra Espichán Over a year ago

Thanks for your suggestion. As you say it is pretty close, but I need the column names to be in the right place. I found trying that this works as expected: df = pd.read_excel("/tmp/sample.xlsx", header=[1,2]).reset_index(drop=True). I don't know exactly why it works with that header parameter.

Orenshi Over a year ago

I think this should do the job pd.read_excel("/tmp/sample.xlsx", header[1,2]).reset_index().drop(['index'], level=0, axis=1)

Orenshi Over a year ago

I've also modified the original post with my interpretation and understanding of the documentation for read_excel's header parameter. Hopefully others can chime in to clarify our understanding.

BallpointBen · Accepted Answer · 2018-05-22 17:11:29Z

1

Here is the documentation on the header parameter:

Row (0-indexed) to use for the column labels of the parsed DataFrame. If a list of integers is passed those row positions will be combined into a MultiIndex. Use None if there is no header.

I think the following should work:

df = pd.read_excel("test.xlsx", skiprows=2, usecols='B:F', header=0)

answered May 22, 2018 at 17:11

BallpointBen

15.6k2 gold badges46 silver badges81 bronze badges

3 Comments

Orenshi Over a year ago

@OP this is a good solution if you're okay with dropping Header 1 and Header 2.

Alexandra Espichán Over a year ago

Thanks for your suggestion. But I need Header 1 and Header 2. And what about if I don't know exactly how many columns there are? It can change, so I can't use usecols ='B:F'

Avik Aggarwal Over a year ago

@AlexandraEspichán were you able to find solution on this? I am looking for something similar.

Collectives™ on Stack Overflow

Pandas read excel sheet with multiple header when first column is empty

2 Answers 2

3 Comments

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related