1

I have a dataframe df made up of n columns which are groups and one, "data". This dataframe is then grouped on the n group columns.

df = pd.DataFrame(data={"g0": ["foo", "foo", "bar", "bar"], 
                        "g1": ["baz", "baz", "baz", "qux"], 
                        ...,
                        "gn": [...],
                        "data": [0.1, 0.3, 0.4, 0.2]}, 
                        index=["a", "b", "c", "d"])
groups = df.groupby(by=["g0", "g1", ..., "gn"], sort=False)

Then I have a list idx_kept which includes only some of the original dataframe indices e.g. idx_kept = ["a", "b", "d"]. Is there a way to filter groups and keep only the data which initially had the indices in idx_kept? My understanding of DataFrameGroupBy.filter is that it is not appropriate in that case as it uses an aggregate function and removes whole groups.

I could filter df directly to get a df_filtered and do a groups_filtered=df_filtered.groupby(by=["g0", "g1", ..., "gn"], sort=False). However, In my process, I need both groups and groups_filtered so my goal is to avoid a second groupby to save some time. Is there an elegant/fast way to achieve that?

Edit: I realise that I should have had given some more info as I received good answers which did not work for my case. My end goal is to compare len(groups) and len(groups_filtered). In the example, using g0, g1, and idx_kept = ["a", "b", "d"], len(groups) = 3 but len(groups_filtered) = 2 because "c" was the only member of its group. However, if idx_kept = ["a", "c", "d"], len(groups_filtered) = 3 because "b" was part of a group containing "a" and "b". So potentially, there is another approach to do that than the one I thought about.

2 Answers 2

1

A possible solution, which first filters the dataframe by idx_kept, and then does the grouping.

groups_filtered = df[df.index.isin(idx_kept)].groupby(by=["g0", "g1"], sort=False)

In case you need to keep the unfiltered groups:

groups = df.groupby(by=["g0", "g1"], sort=False)
[x[x.index.isin(idx_kept)] for _, x in groups]
Sign up to request clarification or add additional context in comments.

3 Comments

This works. However, I do need the unfiltered dataframe. So this still leads to use call twice DataFrame.groupby(), which is what is currently being done. I would like to avoid that if possible.
If I understood well what you are needing, you might try: groups = df.groupby(by=["g0", "g1"], sort=False) [x[x.index.isin(idx_kept)] for _, x in groups]. See my above updated solution.
This worked with a sligth twist as I need the lumber of groups post filtering (I edited my post which was lakking some information.: [x[x.index.isin(idx_kept)] for _, x in groups if not x[x.index.isin(idx_kept)].empty].
1

Once the groups are formed, it is not directly possible to filter prior to an aggregation with groupby.agg. However, you could do this with groupby.apply.

For instance, if your aggregation is groupby.sum/groupby.agg('sum'), you could use:

# without filtering
groups[['data']].apply(lambda x: x.sum())

# with filtering
groups[['data']].apply(lambda x: x[x.index.isin(idx_kept)].sum())

Example output:

# without filtering
             data
g0  g1  gn       
foo baz xxx   0.4
bar baz xxx   0.4
    qux xxx   0.2

# with filtering
             data
g0  g1  gn       
foo baz xxx   0.4
bar baz xxx   0.0
    qux xxx   0.2

However, groupby.apply is often quite less efficient than groupby.agg (or native aggregation functions). Efficiency should be tested on the real data, but I would not be surprised that performing two groupby.agg could be faster.

Another option, assuming your aggregation functions are not sensitive to NaN, could be to join a filtered version of your data before aggregation to be able to use efficient aggregation simultaneously on the raw and filtered data:

data_cols = ['data']
groups = (df.join(df.loc[df.index.isin(idx_kept), data_cols]
                    .add_suffix('_filtered'))
            .groupby(by=['g0', 'g1', 'gn'], sort=False)
         )
out = groups.sum()

Output:

             data  data_filtered
g0  g1  gn                      
foo baz xxx   0.4            0.4
bar baz xxx   0.4            0.0
    qux xxx   0.2            0.2

Intermediate output of join:

    g0   g1   gn  data  data_filtered
a  foo  baz  xxx   0.1            0.1
b  foo  baz  xxx   0.3            0.3
c  bar  baz  xxx   0.4            NaN
d  bar  qux  xxx   0.2            0.2

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.