Python Loop through Excel sheets, place into one df

Question

I have an excel file foo.xlsx with about 40 sheets sh1, sh2, etc. Each sheet has the format:

area      cnt   name\nparty1   name\nparty2
blah      9         5               5
word      3         7               5

In each sheet I want to rename the variables with the format name\nparty to only have the party as a label. Example output:

area      cnt    party1    party2     sheet
bacon     9         5         5        sh1
spam      3         7         5        sh1
eggs      2         18        4        sh2

I am reading in the file with:

book = pd.ExcelFile(path)

And then wondering if I need to do:

for f in filelist:
    df = pd.ExcelFile.parse(book,sheetname=??)
    'more operations here'
    # only change column names 2 and 3
     i, col in enumerate(df):
     if i>=2 and i<=3:
        new_col_name = col.split("\n")[-1]
        df[new_col_name] =

Or something like that?

You can also try using xlwings dataworldofredhairedgirl.blogspot.com/2024/07/… — 123456
– 123456, Commented Jul 11, 2024 at 6:59

asongtoruin · Accepted Answer · 2021-11-05 15:45:28Z

61

The read_excel method of pandas lets you read all sheets in at once if you set the keyword parameter sheet_name=None (in some older versions of pandas this was called sheetname). This returns a dictionary - the keys are the sheet names, and the values are the sheets as dataframes.

Using this, we can simply loop through the dictionary and:

Add an extra column to the dataframes containing the relevant sheetname
Use the rename method to rename our columns - by using a lambda, we simply take the final entry of the list obtained by splitting each column name any time there is a new line. If there is no new line, the column name is unchanged.
Append to a list, to be combined at the end.

Once this is done, we combine all the sheets into one with pd.concat. Then we reset the index and all should be well. Note: if you have parties present on one sheet but not others, this will still work but will fill any missing columns for each sheet with NaN.

import pandas as pd

sheets_dict = pd.read_excel('Book1.xlsx', sheet_name=None)

all_sheets = []
for name, sheet in sheets_dict.items():
    sheet['sheet'] = name
    sheet = sheet.rename(columns=lambda x: x.split('\n')[-1])
    all_sheets.append(sheet)

full_table = pd.concat(all_sheets)
full_table.reset_index(inplace=True, drop=True)

print(full_table)

Prints:

    area  cnt  party1  party2   sheet
0  bacon    9       5       5  Sheet1
1   spam    3       7       5  Sheet1
2   eggs    2      18       4  Sheet2

edited Nov 5, 2021 at 15:45

answered Jun 14, 2017 at 15:46

asongtoruin

10.4k3 gold badges42 silver badges52 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Yolo_chicken Over a year ago

I'm sorry I was unclear. The name in name\nparty changes each sheet. It's electoral results and I don't want the candidates name just their party. Is there some kind of wild card or string split to only keep everything after the \n?

DalekSec Over a year ago

I think .split('\n')[-1] only keeps parts of a string after the \n. For example, "Frank Underwood\nFictional Democrat".split('\n')[-1] returns 'Fictional Democrat'

asongtoruin Over a year ago

@DalekSec was just editing this in! The correct approach for sure.

Kim Stacks Over a year ago

for version 0.25.1 it should be sheet_name not sheetname. I didn't check if sheetname works in previous versions.

marc_s · Accepted Answer · 2022-01-30 12:15:35Z

3

Consider the following code also using Panda library.

It takes in only a single sheet and uses df's iterrows():

def read_csv():
    filename = "file.xlsx"
    sheet_name = "Sheet Name"
    df = pd.read_excel(filename, sheet_name=sheet_name)
    # Updating Nan to null
    df = df.where(pd.notnull(df), None)
    data = []
    for index, row in df.iterrows():
        # you can take data as row[COLUMN_NAME], then append it to data like data.append({'column': row[column})
    return data

It's not entirely related to question asked. Just posting for anybody whose needed

edited Jan 30, 2022 at 12:15

marc_s

760k186 gold badges1.4k silver badges1.5k bronze badges

answered Jan 29, 2022 at 17:59

Haribk

3193 silver badges8 bronze badges

Comments

cottontail · Accepted Answer · 2023-02-17 21:21:50Z

Sometimes if the Excel file is really large, instead of reading the entire file into memory, it's better if you read the sheets in one by one. You can do using ExcelFile:

with pd.ExcelFile('foo.xlsx') as f:
    sheets = f.sheet_names
    for sht in sheets:
        df = f.parse(sht)
        # do something with df

That said, if the task is to concatenate all sheets into a single frame, there's also a one-liner available:

joined_df = pd.concat(pd.read_excel('foo.xlsx', sheet_name=None).values(), ignore_index=True)

or for OP's specific case, pass in names to overwrite the column names of each sheet (instead of operating on each sheet) and concatenate them all.

joined_df = (
    pd.concat(pd.read_excel('foo.xlsx', names=['area','cnt','party1','party2'], sheet_name=None))
    .rename_axis(['Sheet', None]).reset_index(level=0)
    .reset_index(drop=True)
)

Collectives™ on Stack Overflow

Python Loop through Excel sheets, place into one df

3 Answers 3

4 Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

4 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related