32

I have an excel file foo.xlsx with about 40 sheets sh1, sh2, etc. Each sheet has the format:

area      cnt   name\nparty1   name\nparty2
blah      9         5               5
word      3         7               5

In each sheet I want to rename the variables with the format name\nparty to only have the party as a label. Example output:

area      cnt    party1    party2     sheet
bacon     9         5         5        sh1
spam      3         7         5        sh1
eggs      2         18        4        sh2

I am reading in the file with:

book = pd.ExcelFile(path) 

And then wondering if I need to do:

for f in filelist:
    df = pd.ExcelFile.parse(book,sheetname=??)
    'more operations here'
    # only change column names 2 and 3
     i, col in enumerate(df):
     if i>=2 and i<=3:
        new_col_name = col.split("\n")[-1]
        df[new_col_name] =

Or something like that?

1

3 Answers 3

61

The read_excel method of pandas lets you read all sheets in at once if you set the keyword parameter sheet_name=None (in some older versions of pandas this was called sheetname). This returns a dictionary - the keys are the sheet names, and the values are the sheets as dataframes.

Using this, we can simply loop through the dictionary and:

  1. Add an extra column to the dataframes containing the relevant sheetname
  2. Use the rename method to rename our columns - by using a lambda, we simply take the final entry of the list obtained by splitting each column name any time there is a new line. If there is no new line, the column name is unchanged.
  3. Append to a list, to be combined at the end.

Once this is done, we combine all the sheets into one with pd.concat. Then we reset the index and all should be well. Note: if you have parties present on one sheet but not others, this will still work but will fill any missing columns for each sheet with NaN.

import pandas as pd

sheets_dict = pd.read_excel('Book1.xlsx', sheet_name=None)

all_sheets = []
for name, sheet in sheets_dict.items():
    sheet['sheet'] = name
    sheet = sheet.rename(columns=lambda x: x.split('\n')[-1])
    all_sheets.append(sheet)

full_table = pd.concat(all_sheets)
full_table.reset_index(inplace=True, drop=True)

print(full_table)

Prints:

    area  cnt  party1  party2   sheet
0  bacon    9       5       5  Sheet1
1   spam    3       7       5  Sheet1
2   eggs    2      18       4  Sheet2
Sign up to request clarification or add additional context in comments.

4 Comments

I'm sorry I was unclear. The name in name\nparty changes each sheet. It's electoral results and I don't want the candidates name just their party. Is there some kind of wild card or string split to only keep everything after the \n?
I think .split('\n')[-1] only keeps parts of a string after the \n. For example, "Frank Underwood\nFictional Democrat".split('\n')[-1] returns 'Fictional Democrat'
@DalekSec was just editing this in! The correct approach for sure.
for version 0.25.1 it should be sheet_name not sheetname. I didn't check if sheetname works in previous versions.
3

Consider the following code also using Panda library.

It takes in only a single sheet and uses df's iterrows():

def read_csv():
    filename = "file.xlsx"
    sheet_name = "Sheet Name"
    df = pd.read_excel(filename, sheet_name=sheet_name)
    # Updating Nan to null
    df = df.where(pd.notnull(df), None)
    data = []
    for index, row in df.iterrows():
        # you can take data as row[COLUMN_NAME], then append it to data like data.append({'column': row[column})
    return data

It's not entirely related to question asked. Just posting for anybody whose needed

Comments

2

Sometimes if the Excel file is really large, instead of reading the entire file into memory, it's better if you read the sheets in one by one. You can do using ExcelFile:

with pd.ExcelFile('foo.xlsx') as f:
    sheets = f.sheet_names
    for sht in sheets:
        df = f.parse(sht)
        # do something with df

That said, if the task is to concatenate all sheets into a single frame, there's also a one-liner available:

joined_df = pd.concat(pd.read_excel('foo.xlsx', sheet_name=None).values(), ignore_index=True)

or for OP's specific case, pass in names to overwrite the column names of each sheet (instead of operating on each sheet) and concatenate them all.

joined_df = (
    pd.concat(pd.read_excel('foo.xlsx', names=['area','cnt','party1','party2'], sheet_name=None))
    .rename_axis(['Sheet', None]).reset_index(level=0)
    .reset_index(drop=True)
)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.