0

I have a complex dataframe with multiple columns. All of them being MultiIndex based. At some point I wanted to be quite specific when it comes to estimating some metrics so I started experimenting with the .groupby method. I can manage to do the basics: 1) computing the aggregation method on the whole dataframe or 2) computing it for one specific column. However, I am interested in computing the aggreagtion method by indicating some of the names within the first column levels. This is quite easy to do when there is just a single level within the columns. In order to be understood, I created the following MRO that reproduces my idea and the errors I am getting:

import numpy as np
import pandas as pd


columns = pd.MultiIndex.from_tuples(
    [
        ("Dimensions", "x"),
        ("Dimensions", "y"),
        ("Dimensions", "z"),
        ("Coefficient", ""),
        ("Comments", ""),
    ],
    names=["Category", "Details"],
)

df = pd.DataFrame(index=range(11), columns=columns)
df[("Dimensions", "x")] = np.random.randint(1, 100, size=11)
df[("Dimensions", "y")] = np.random.randint(1, 100, size=11)
df[("Dimensions", "z")] = np.random.randint(1, 100, size=11)
df[("Coefficient", "")] = np.random.randint(1, 50, size=11)  # Coefficient como entero aleatorio
df[("Comments", "")] = np.random.choice(["Good", "Average", "Bad"], size=11)
df["Comments"] = df["Comments"].astype("category")

# Basic metrics
print(df.groupby("Comments").mean())  # It works
print(df.groupby("Comments")["Dimensions"].mean())  # It works

# Selecting multiple columns within a MultiIndex based one. Different ideas I tried:
df.groupby("Comments")["Dimensions", "Coefficient"].mean()  # It does not work
df.groupby("Comments")[["Dimensions", "Coefficient"]].mean()  # It does not work
df.groupby("Comments").agg({"Dimensions": "mean", "Coefficient": "mean"})  # It does not work

2 Answers 2

1

If you use print(df.columns) you'll see the true column names are tuples rather than single strings.

Try this:

df.groupby("Comments")[[('Dimensions', 'x'), ('Dimensions', 'y'), ('Dimensions', 'z'), ('Coefficient',  '')]].mean()


Category    Dimensions                 Coefficient
Details     x      y         z  
Comments                
Average     35.00  55.166667 59.333333  21.833333
Bad         81.75  24.250000 45.750000  35.750000
Good        36.00  1.000000  42.000000  20.000000
Sign up to request clarification or add additional context in comments.

2 Comments

Thanks! It makes sense! But Ideally It would be nice if I could choose them by only using the outermost level (i.e Dimensions and Coefficient). This is because I might have many more inner levels and this approach would make me write all columns. So far I could only managed by doing some "dirty" filtering, but I was wondering if there is a more standard approach: python columns = df.columns[df.columns.get_level_values('Category').isin(['Coefficient', "Dimensions"])] df.groupby("Comments")[columns].mean()
I would use list comprehension. Try df.groupby('Comments', observed=True)[[x for x in df.columns if x[0] in ['Dimensions', 'Coefficient']]].mean()
0

Your suggestion of preselecting the columns is a good standard approach. One other option is with the get_columns from pyjanitor, which allows some flexible selection on a groupby object. I am a contributor to this library.

For your use case, since it is a MultiIndex, you can pass a dictionary, where the keys in the dictionary are the levels of the Index(name or position is accepted) and the values in the dictionary are the column labels:

# pip install pyjanitor
import janitor as jn
import pandas as pd

# reusing a groupby object usually offers
# good performance, depending on the usecase
grp = df.groupby('Comments',oserved=True)

In [165]: jn.get_columns(grp, {'Category':['Dimensions','Coefficient']})
Out[165]: <pandas.core.groupby.generic.DataFrameGroupBy object at 0x30c0258d0>

In [166]: jn.get_columns(grp, {'Category':['Dimensions','Coefficient']}).mean()
Out[166]:
Category Dimensions               Coefficient
Details           x      y      z
Comments
Average       65.00  62.20  37.00       15.20
Bad           66.25  74.75  38.25       30.75
Good          33.00  52.50  39.00       30.00
# you can also pipe it
(df
.groupby("Comments",observed=True)
.pipe(jn.get_columns, {'Category':['Dimensions','Coefficient']})
.mean()
)
Category Dimensions               Coefficient
Details           x      y      z
Comments
Average       65.00  62.20  37.00       15.20
Bad           66.25  74.75  38.25       30.75
Good          33.00  52.50  39.00       30.00

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.