Filter a pandas df: per group, keep only non-null rows if we have them, else keep a single null row

Question

Hopefully the title is reasonably intuitive, edits welcome. Say I have this dataframe:

df = pd.DataFrame({'x': ['A', 'B', 'B', 'C', 'C', 'C', 'D', 'D'],
                   'y': [None, None, 1, 2, 3, 4, None, None]})

    x   y
0   A   NaN
1   B   NaN
2   B   1.0
3   C   2.0
4   C   3.0
5   C   4.0
6   D   NaN
7   D   NaN

Per grouping variable, x in this case, I want to keep:

only the rows where y is not None if any non-null values exist
a single row to represent x in the case that all y is None

That is: keep A (only one null row), only non-null B, all of C, and one row for D

Here is one approach:

pd.concat([
    df.groupby('x').filter(lambda x: any(x['y'].notna())).dropna(),
    df.groupby('x').filter(lambda x: all(x['y'].isna())).drop_duplicates()
])

    x   y
2   B   1.0
3   C   2.0
4   C   3.0
5   C   4.0
0   A   NaN
6   D   NaN

I could also drop NAs and merge with unique values of x to bring back any that are no longer represented?

df.loc[df['y'].notna()].merge(df[['x']].drop_duplicates(),
                              on='x', how='outer')

    x   y
0   A   NaN
1   B   1.0
2   C   2.0
3   C   3.0
4   C   4.0
5   D   NaN

Is there something more elegant than this? I thought of some kind of all-in-one filter() but struck out...

Triky · Accepted Answer · 2025-11-06 13:11:33Z

3

Keep the non nan values with "notna". For the groups where all the values are nan, sort the dataframe to place the non nan values before the nan values, and after that check for duplicates based on the column x. Invert the mask to keep the first row for each group and remove the rest.

m1 = df['y'].notna()
m2 = df.sort_values(by='y', na_position='last').duplicated(subset='x')

result = df[m1|~m2]

End result:

x   y
A NaN
B 1.0
C 2.0
C 3.0
C 4.0
D NaN

edited Nov 6 at 13:11

answered Nov 6 at 10:01

Triky

7441 gold badge4 silver badges5 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

mozway Nov 6 at 11:17

That's a nice trick. For clarity I'd recommend to set na_position='last' in sort_values (even if this is the default).

mozway · Accepted Answer · 2025-11-06 08:20:02Z

You can simply keep the rows for which all y values are None per group (isna + groupby.transform+all), or all the non-null values with boolean indexing:

m = df['y'].isna()
out = df[
    (m.groupby(df['x']).transform('all') & ~df.loc[m, 'x'].duplicated()) | ~m
]

Output:

   x    y
0  A  NaN
2  B  1.0
3  C  2.0
4  C  3.0
5  C  4.0
6  D  NaN

Intermediates:

   x    y      m duplicated  groupby.transform('all')     ~m  final
0  A  NaN   True      False                      True  False   True
1  B  NaN   True      False                     False  False  False
2  B  1.0  False        NaN                     False   True   True
3  C  2.0  False        NaN                     False   True   True
4  C  3.0  False        NaN                     False   True   True
5  C  4.0  False        NaN                     False   True   True
6  D  NaN   True      False                      True  False   True
7  D  NaN   True       True                      True  False  False

Or a better approach, without groupby. Identify the null rows, select the X for which there is at least one non-null values and select the others. You can do this only with boolean masks and an OR (|):

# which values are not-null?
m1 = df['y'].notna()
# which groups only contain null values?
m2 = ~df['x'].isin(df.loc[m1, 'x'])
# which nulls are not duplicated?
m3 = ~df[['x', 'y']].duplicated()
# keep rows that match either condition above
out = df[m1 | (m2 & m3)]

As a one-liner:

out = df[
    (m := df['y'].notna())
    | (~df['x'].isin(df.loc[m, 'x']) & ~df[['x', 'y']].duplicated())
]

Output:

   x    y
0  A  NaN
2  B  1.0
3  C  2.0
4  C  3.0
5  C  4.0
6  D  NaN

Intermediates:

   x    y     m1     m2     m3  m1 | (m2 & m3)
0  A  NaN  False   True   True            True
1  B  NaN  False  False   True           False
2  B  1.0   True  False   True            True
3  C  2.0   True  False   True            True
4  C  3.0   True  False   True            True
5  C  4.0   True  False   True            True
6  D  NaN  False   True   True            True
7  D  NaN  False   True  False           False

PaulS · Accepted Answer · 2025-11-05 22:41:53Z

A possible solution:

m = df.groupby('x')['y'].transform(lambda x: x.notna().any())

(df[m & df['y'].notna() | (~m)]
 .drop_duplicates(subset=['x', 'y'], keep='first')
 .reset_index(drop=True))

It uses a combination of groupby with transform to create a boolean mask m that identifies groups containing any non-null values. The expression m & df['y'].notna() | (~m) then selects either: (1) non-null rows from groups that have at least one non-null value, or (2) all rows from groups where all values are null. Finally, drop_duplicates with subset=['x', 'y'] ensures that for the null-only groups, only one representative row is kept while preserving all distinct non-null values from mixed groups, and reset_index cleans up the resulting index.

Output:

   x    y
0  A  NaN
1  B  1.0
2  C  2.0
3  C  3.0
4  C  4.0
5  D  NaN

Collectives™ on Stack Overflow

Filter a pandas df: per group, keep only non-null rows if we have them, else keep a single null row

3 Answers 3

1 Comment

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related