2

I work with people who use Excel and continuously add or subtract rows unbeknownst to me. I have to scrape a document for data, and the row where the header is found changes based on moods.

My challenge is to handle these oscillating currents by detecting where the header is.

I first organized my scrape using xlrd and a number of conditional statements using the values in the workbook.

My initial attempt works and is long (so I will not publish it) but involves bringing in the entire sheet, and not slices:

from xlrd import open_workbook

book = open_workbook(fName)
sheet = book.sheet_by_name(sht)

return book,sheet

However, it is big and I would prefer to get a more targeted selection. The header values never change, nor does when the data shows up after this row.

Do you know of a way to implicitly get the header based on a found value in the sheet using either pandas.ExcelFile or pandas.read_excel?

Here is my attempt with pandas.ExcelFile:

import pandas as pd

xlsx = pd.ExcelFile(fName)
dataFrame = pd.read_excel(xlsx, sht,
                          parse_cols=21, merge_cells=noMerge, 
                          header=header)

return dataFrame

I cannot get the code to work unless I give the call the correct header value, which is exactly what I'm hoping to avoid.

This previous question seems to present a similar problem without addressing the concern of finding the headers implicitly.

2
  • why don't you loop and parse the first rows of the sheet until you find the line where the header is? It will give you the number of rows to pass to skip_rows and you will have pandas parsing you table as usual. Commented Sep 12, 2016 at 17:06
  • Loop and parse using what? That's what my xlrd code does, but I'm unclear how this woud look Commented Sep 12, 2016 at 17:08

1 Answer 1

1

Do the same loop through ExcelFile objects:

xlsx = pd.ExcelFile(fName)
sheet = xlsx.sheet_by_name(sht)
# apply the same algorithm you wrote against xlrd here
# ... results in having header_row = something, 0 based
dataFrame = pd.read_excel(xlsx, sht,
                      parse_cols=21, merge_cells=noMerge, 
                      skip_rows=header_row)
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.