Selecting multiple columns (`MultiIndex` based) within a `DataFrameGroupBy`

Question

I have a complex dataframe with multiple columns. All of them being MultiIndex based. At some point I wanted to be quite specific when it comes to estimating some metrics so I started experimenting with the .groupby method. I can manage to do the basics: 1) computing the aggregation method on the whole dataframe or 2) computing it for one specific column. However, I am interested in computing the aggreagtion method by indicating some of the names within the first column levels. This is quite easy to do when there is just a single level within the columns. In order to be understood, I created the following MRO that reproduces my idea and the errors I am getting:

import numpy as np
import pandas as pd


columns = pd.MultiIndex.from_tuples(
    [
        ("Dimensions", "x"),
        ("Dimensions", "y"),
        ("Dimensions", "z"),
        ("Coefficient", ""),
        ("Comments", ""),
    ],
    names=["Category", "Details"],
)

df = pd.DataFrame(index=range(11), columns=columns)
df[("Dimensions", "x")] = np.random.randint(1, 100, size=11)
df[("Dimensions", "y")] = np.random.randint(1, 100, size=11)
df[("Dimensions", "z")] = np.random.randint(1, 100, size=11)
df[("Coefficient", "")] = np.random.randint(1, 50, size=11)  # Coefficient como entero aleatorio
df[("Comments", "")] = np.random.choice(["Good", "Average", "Bad"], size=11)
df["Comments"] = df["Comments"].astype("category")

# Basic metrics
print(df.groupby("Comments").mean())  # It works
print(df.groupby("Comments")["Dimensions"].mean())  # It works

# Selecting multiple columns within a MultiIndex based one. Different ideas I tried:
df.groupby("Comments")["Dimensions", "Coefficient"].mean()  # It does not work
df.groupby("Comments")[["Dimensions", "Coefficient"]].mean()  # It does not work
df.groupby("Comments").agg({"Dimensions": "mean", "Coefficient": "mean"})  # It does not work

amance · Accepted Answer · 2024-09-19 15:43:28Z

1

If you use print(df.columns) you'll see the true column names are tuples rather than single strings.

Try this:

df.groupby("Comments")[[('Dimensions', 'x'), ('Dimensions', 'y'), ('Dimensions', 'z'), ('Coefficient',  '')]].mean()


Category    Dimensions                 Coefficient
Details     x      y         z  
Comments                
Average     35.00  55.166667 59.333333  21.833333
Bad         81.75  24.250000 45.750000  35.750000
Good        36.00  1.000000  42.000000  20.000000

answered Sep 19, 2024 at 15:43

amance

1,8528 silver badges18 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Manu Over a year ago

Thanks! It makes sense! But Ideally It would be nice if I could choose them by only using the outermost level (i.e Dimensions and Coefficient). This is because I might have many more inner levels and this approach would make me write all columns. So far I could only managed by doing some "dirty" filtering, but I was wondering if there is a more standard approach:

python columns = df.columns[df.columns.get_level_values('Category').isin(['Coefficient', "Dimensions"])]  df.groupby("Comments")[columns].mean()

amance Over a year ago

I would use list comprehension. Try df.groupby('Comments', observed=True)[[x for x in df.columns if x[0] in ['Dimensions', 'Coefficient']]].mean()

sammywemmy · Accepted Answer · 2024-09-29 09:06:32Z

Your suggestion of preselecting the columns is a good standard approach. One other option is with the get_columns from pyjanitor, which allows some flexible selection on a groupby object. I am a contributor to this library.

For your use case, since it is a MultiIndex, you can pass a dictionary, where the keys in the dictionary are the levels of the Index(name or position is accepted) and the values in the dictionary are the column labels:

# pip install pyjanitor
import janitor as jn
import pandas as pd

# reusing a groupby object usually offers
# good performance, depending on the usecase
grp = df.groupby('Comments',oserved=True)

In [165]: jn.get_columns(grp, {'Category':['Dimensions','Coefficient']})
Out[165]: <pandas.core.groupby.generic.DataFrameGroupBy object at 0x30c0258d0>

In [166]: jn.get_columns(grp, {'Category':['Dimensions','Coefficient']}).mean()
Out[166]:
Category Dimensions               Coefficient
Details           x      y      z
Comments
Average       65.00  62.20  37.00       15.20
Bad           66.25  74.75  38.25       30.75
Good          33.00  52.50  39.00       30.00
# you can also pipe it
(df
.groupby("Comments",observed=True)
.pipe(jn.get_columns, {'Category':['Dimensions','Coefficient']})
.mean()
)
Category Dimensions               Coefficient
Details           x      y      z
Comments
Average       65.00  62.20  37.00       15.20
Bad           66.25  74.75  38.25       30.75
Good          33.00  52.50  39.00       30.00

Collectives™ on Stack Overflow

Selecting multiple columns (`MultiIndex` based) within a `DataFrameGroupBy`

2 Answers 2

2 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related