3

Hopefully the title is reasonably intuitive, edits welcome. Say I have this dataframe:

df = pd.DataFrame({'x': ['A', 'B', 'B', 'C', 'C', 'C', 'D', 'D'],
                   'y': [None, None, 1, 2, 3, 4, None, None]})

    x   y
0   A   NaN
1   B   NaN
2   B   1.0
3   C   2.0
4   C   3.0
5   C   4.0
6   D   NaN
7   D   NaN

Per grouping variable, x in this case, I want to keep:

  • only the rows where y is not None if any non-null values exist
  • a single row to represent x in the case that all y is None

That is: keep A (only one null row), only non-null B, all of C, and one row for D

Here is one approach:

pd.concat([
    df.groupby('x').filter(lambda x: any(x['y'].notna())).dropna(),
    df.groupby('x').filter(lambda x: all(x['y'].isna())).drop_duplicates()
])

    x   y
2   B   1.0
3   C   2.0
4   C   3.0
5   C   4.0
0   A   NaN
6   D   NaN

I could also drop NAs and merge with unique values of x to bring back any that are no longer represented?

df.loc[df['y'].notna()].merge(df[['x']].drop_duplicates(),
                              on='x', how='outer')

    x   y
0   A   NaN
1   B   1.0
2   C   2.0
3   C   3.0
4   C   4.0
5   D   NaN

Is there something more elegant than this? I thought of some kind of all-in-one filter() but struck out...

3 Answers 3

3

Keep the non nan values with "notna". For the groups where all the values are nan, sort the dataframe to place the non nan values before the nan values, and after that check for duplicates based on the column x. Invert the mask to keep the first row for each group and remove the rest.

m1 = df['y'].notna()
m2 = df.sort_values(by='y', na_position='last').duplicated(subset='x')

result = df[m1|~m2]

End result:

x   y
A NaN
B 1.0
C 2.0
C 3.0
C 4.0
D NaN
Sign up to request clarification or add additional context in comments.

1 Comment

That's a nice trick. For clarity I'd recommend to set na_position='last' in sort_values (even if this is the default).
1

You can simply keep the rows for which all y values are None per group (isna + groupby.transform+all), or all the non-null values with boolean indexing:

m = df['y'].isna()
out = df[
    (m.groupby(df['x']).transform('all') & ~df.loc[m, 'x'].duplicated()) | ~m
]

Output:

   x    y
0  A  NaN
2  B  1.0
3  C  2.0
4  C  3.0
5  C  4.0
6  D  NaN

Intermediates:

   x    y      m duplicated  groupby.transform('all')     ~m  final
0  A  NaN   True      False                      True  False   True
1  B  NaN   True      False                     False  False  False
2  B  1.0  False        NaN                     False   True   True
3  C  2.0  False        NaN                     False   True   True
4  C  3.0  False        NaN                     False   True   True
5  C  4.0  False        NaN                     False   True   True
6  D  NaN   True      False                      True  False   True
7  D  NaN   True       True                      True  False  False

Or a better approach, without groupby. Identify the null rows, select the X for which there is at least one non-null values and select the others. You can do this only with boolean masks and an OR (|):

# which values are not-null?
m1 = df['y'].notna()
# which groups only contain null values?
m2 = ~df['x'].isin(df.loc[m1, 'x'])
# which nulls are not duplicated?
m3 = ~df[['x', 'y']].duplicated()
# keep rows that match either condition above
out = df[m1 | (m2 & m3)]

As a one-liner:

out = df[
    (m := df['y'].notna())
    | (~df['x'].isin(df.loc[m, 'x']) & ~df[['x', 'y']].duplicated())
]

Output:

   x    y
0  A  NaN
2  B  1.0
3  C  2.0
4  C  3.0
5  C  4.0
6  D  NaN

Intermediates:

   x    y     m1     m2     m3  m1 | (m2 & m3)
0  A  NaN  False   True   True            True
1  B  NaN  False  False   True           False
2  B  1.0   True  False   True            True
3  C  2.0   True  False   True            True
4  C  3.0   True  False   True            True
5  C  4.0   True  False   True            True
6  D  NaN  False   True   True            True
7  D  NaN  False   True  False           False

Comments

0

A possible solution:

m = df.groupby('x')['y'].transform(lambda x: x.notna().any())

(df[m & df['y'].notna() | (~m)]
 .drop_duplicates(subset=['x', 'y'], keep='first')
 .reset_index(drop=True))

It uses a combination of groupby with transform to create a boolean mask m that identifies groups containing any non-null values. The expression m & df['y'].notna() | (~m) then selects either: (1) non-null rows from groups that have at least one non-null value, or (2) all rows from groups where all values are null. Finally, drop_duplicates with subset=['x', 'y'] ensures that for the null-only groups, only one representative row is kept while preserving all distinct non-null values from mixed groups, and reset_index cleans up the resulting index.

Output:

   x    y
0  A  NaN
1  B  1.0
2  C  2.0
3  C  3.0
4  C  4.0
5  D  NaN

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.