Pandas: Reading Excel with merged cells

Question

I have Excel files with multiple sheets, each of which looks a little like this (but much longer):

        Sample  CD4     CD8
Day 1   8311    17.3    6.44
        8312    13.6    3.50
        8321    19.8    5.88
        8322    13.5    4.09
Day 2   8311    16.0    4.92
        8312    5.67    2.28
        8321    13.0    4.34
        8322    10.6    1.95

The first column is actually four cells merged vertically.

When I read this using pandas.read_excel, I get a DataFrame that looks like this:

       Sample    CD4   CD8
Day 1    8311  17.30  6.44
NaN      8312  13.60  3.50
NaN      8321  19.80  5.88
NaN      8322  13.50  4.09
Day 2    8311  16.00  4.92
NaN      8312   5.67  2.28
NaN      8321  13.00  4.34
NaN      8322  10.60  1.95

How can I either get Pandas to understand merged cells, or quickly and easily remove the NaN and group by the appropriate value? (One approach would be to reset the index, step through to find the values and replace NaNs with values, pass in the list of days, then set the index to the column. But it seems like there should be a simpler approach.)

unutbu · Accepted Answer · 2014-04-08 13:13:43Z

87

You could use the Series.fillna method to forword-fill in the NaN values:

df.index = pd.Series(df.index).fillna(method='ffill')

For example,

In [42]: df
Out[42]: 
       Sample    CD4   CD8
Day 1    8311  17.30  6.44
NaN      8312  13.60  3.50
NaN      8321  19.80  5.88
NaN      8322  13.50  4.09
Day 2    8311  16.00  4.92
NaN      8312   5.67  2.28
NaN      8321  13.00  4.34
NaN      8322  10.60  1.95

[8 rows x 3 columns]

In [43]: df.index = pd.Series(df.index).fillna(method='ffill')

In [44]: df
Out[44]: 
       Sample    CD4   CD8
Day 1    8311  17.30  6.44
Day 1    8312  13.60  3.50
Day 1    8321  19.80  5.88
Day 1    8322  13.50  4.09
Day 2    8311  16.00  4.92
Day 2    8312   5.67  2.28
Day 2    8321  13.00  4.34
Day 2    8322  10.60  1.95

[8 rows x 3 columns]

answered Apr 8, 2014 at 13:13

unutbu

886k197 gold badges1.9k silver badges1.7k bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Samarth Bharadwaj Over a year ago

How would you solve the same problem for merged columns instead of rows?

unutbu Over a year ago

@SamarthBharadwaj: The fillna method has an axis parameter which controls the direction to be filled. To fill all the NaNs in a DataFrame row-wise, you could use df = df.fillna(method='ffill', axis=1). To fill only selected rows, use df.loc or df.iloc. For example, df.loc[mask] = df.loc[mask].fillna(method='ffill', axis=1).

Samarth Bharadwaj Over a year ago

@unutbu thx, but my question is slightly different, expressed here: stackoverflow.com/questions/27420263/…

wander95 Over a year ago

also worked when the problematic column was not the index

PlasmaBinturong Over a year ago

fillna with ffill is ok as long as a merged cell is not followed by a volontarily empty cell...

|

Nathan Pyle · Accepted Answer · 2022-01-24 22:51:58Z

22

To casually come back 8 years later, pandas.read_excel() can solve this internally for you with the index_col parameter.

df = pd.read_excel('path_to_file.xlsx', index_col=[0])

Passing index_col as a list will cause pandas to look for a MultiIndex. In the case where there is a list of length one, pandas creates a regular Index filling in the data.

answered Jan 24, 2022 at 22:51

Nathan Pyle

6217 silver badges11 bronze badges

2 Comments

Michael Tiemann Over a year ago

Brilliant! In my case, index_col=[0,1,2,3]. But yes!!

Sudarshan Kadam Over a year ago

Thank you!!! No wonder reading other comments than the top answer always pays off :)

Tim · Accepted Answer · 2019-10-19 10:33:08Z

15

df = df.fillna(method='ffill', axis=0)  # resolved updating the missing row entries

edited Oct 19, 2019 at 10:33

Tim

2,6831 gold badge23 silver badges27 bronze badges

answered Oct 19, 2019 at 6:48

Muth

1511 silver badge2 bronze badges

1 Comment

Adrian Mole Over a year ago

Code-only answers are generally frowned upon on Stack Overflow. In order to avoid being closed as 'low quality', please add some explanatory text.

Weston A. Greene · Accepted Answer · 2024-11-11 16:14:39Z

2

To read an Excel file where merged cells are filled in (in other words, the Pandas DataFrame values are all the same), I used the following code. It was largely inspired by @ztr. Thank you, ztr.

from openpyxl import load_workbook
import pandas as pd

def _convert_cell_ref_to_df_ref(cell_ref: tuple) -> tuple:
     col_offset = 1
     row_offset = 1
     return (cell_ref[0] - col_offset, cell_ref[1] - row_offset)

file_path = '/file/path.xlsx'  # Will not work for `.xls`
sheet_name = 'sheet name'

excel = pd.ExcelFile(file_path)
df = excel.parse(
    sheet_name=sheet_name,
    header=None,  # If you want to keep the default headers, then remove this argument and change `col_offset` from 1 to 2.
)
openpyxl_wb = load_workbook(file_path)

for merged_cell in openpyxl_wb[sheet_name].merged_cells:
    try:
        merge_val = df.iloc[_convert_cell_ref_to_df_ref(next(iter(merged_cell.cells)))]
        for cell in merged_cell.cells:
            df.iloc[_convert_cell_ref_to_df_ref(cell)] = merge_val
    except IndexError as e:
        print(f"Most likely the last row in this Excel file is a blank merged cell, which is often times trimmed when read by Pandas.")
        print(e)

See my other answer for how to read sheet names from an Excel file.

edited Nov 11, 2024 at 16:14

answered Aug 13, 2024 at 14:59

Weston A. Greene

1611 gold badge2 silver badges11 bronze badges

3 Comments

Weston A. Greene Over a year ago

I'm so glad @sgao! Happy coding.

Weston A. Greene Dec 24, 2024 at 0:23

This thread may successfully accomplish what I did here but with built-in OpenPyxl functions. I haven't confirmed due to lack of time.

Ofer Barasofsky Apr 3 at 7:25

Thank you, I was really struggling with it. How strange that pandas doesn't distinguish between empty cells that were left empty on purpose and merged cells.

ztr · Accepted Answer · 2024-02-23 08:26:06Z

You can use openpyxl. Note: the excel sheet contains a header row and a index column while the dataframe doesn't, so index has to -1. And openpyxl uses 1-based index while iloc uses 0-based, so index totally -2. This snip code may not be performance efficient because I just handle about 20x20 sheets. You can improve it on your own.

# %%
from openpyxl import load_workbook
import pandas as pd
file_name = "file.xlsx"
df = pd.read_excel(file_name, index_col=0, header=0)

wb = load_workbook(file_name)
sheet = wb.get_sheet_by_name(wb.sheetnames[0])
ms_set = wb.active.merged_cells
# %%
for ms in ms_set:
    # 1-based
    # (start col, start row, end col [included], end row [included])
    b = ms.bounds

    # this  method is not efficient. Especially as you said, your file is large, but you may find a parallelized way to do this or some syntax sugar in python to speed up.
    df.iloc[b[1]-2:b[3]-1, b[0]-2:b[2]-1] = df.iloc[b[1]-2, b[0]-2]
# %%
df
# %%

Collectives™ on Stack Overflow

Pandas: Reading Excel with merged cells

5 Answers 5

6 Comments

2 Comments

1 Comment

3 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

6 Comments

2 Comments

1 Comment

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related