Can pandas implicitly determine header based on value, not row?

Question

I work with people who use Excel and continuously add or subtract rows unbeknownst to me. I have to scrape a document for data, and the row where the header is found changes based on moods.

My challenge is to handle these oscillating currents by detecting where the header is.

I first organized my scrape using xlrd and a number of conditional statements using the values in the workbook.

My initial attempt works and is long (so I will not publish it) but involves bringing in the entire sheet, and not slices:

from xlrd import open_workbook

book = open_workbook(fName)
sheet = book.sheet_by_name(sht)

return book,sheet

However, it is big and I would prefer to get a more targeted selection. The header values never change, nor does when the data shows up after this row.

Do you know of a way to implicitly get the header based on a found value in the sheet using either pandas.ExcelFile or pandas.read_excel?

Here is my attempt with pandas.ExcelFile:

import pandas as pd

xlsx = pd.ExcelFile(fName)
dataFrame = pd.read_excel(xlsx, sht,
                          parse_cols=21, merge_cells=noMerge, 
                          header=header)

return dataFrame

I cannot get the code to work unless I give the call the correct header value, which is exactly what I'm hoping to avoid.

This previous question seems to present a similar problem without addressing the concern of finding the headers implicitly.

why don't you loop and parse the first rows of the sheet until you find the line where the header is? It will give you the number of rows to pass to skip_rows and you will have pandas parsing you table as usual. — Zeugma
– Zeugma, Commented Sep 12, 2016 at 17:06
Loop and parse using what? That's what my xlrd code does, but I'm unclear how this woud look — double0darbo
– double0darbo, Commented Sep 12, 2016 at 17:08

Zeugma · Accepted Answer · 2016-09-12 17:15:38Z

1

Do the same loop through ExcelFile objects:

xlsx = pd.ExcelFile(fName)
sheet = xlsx.sheet_by_name(sht)
# apply the same algorithm you wrote against xlrd here
# ... results in having header_row = something, 0 based
dataFrame = pd.read_excel(xlsx, sht,
                      parse_cols=21, merge_cells=noMerge, 
                      skip_rows=header_row)

answered Sep 12, 2016 at 17:15

Zeugma

32.3k9 gold badges73 silver badges85 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Can pandas implicitly determine header based on value, not row?

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related