Hopefully the title is reasonably intuitive, edits welcome. Say I have this dataframe:
df = pd.DataFrame({'x': ['A', 'B', 'B', 'C', 'C', 'C', 'D', 'D'],
'y': [None, None, 1, 2, 3, 4, None, None]})
x y
0 A NaN
1 B NaN
2 B 1.0
3 C 2.0
4 C 3.0
5 C 4.0
6 D NaN
7 D NaN
Per grouping variable, x in this case, I want to keep:
- only the rows where
yis not None if any non-null values exist - a single row to represent
xin the case that allyis None
That is: keep A (only one null row), only non-null B, all of C, and one row for D
Here is one approach:
pd.concat([
df.groupby('x').filter(lambda x: any(x['y'].notna())).dropna(),
df.groupby('x').filter(lambda x: all(x['y'].isna())).drop_duplicates()
])
x y
2 B 1.0
3 C 2.0
4 C 3.0
5 C 4.0
0 A NaN
6 D NaN
I could also drop NAs and merge with unique values of x to bring back any that are no longer represented?
df.loc[df['y'].notna()].merge(df[['x']].drop_duplicates(),
on='x', how='outer')
x y
0 A NaN
1 B 1.0
2 C 2.0
3 C 3.0
4 C 4.0
5 D NaN
Is there something more elegant than this? I thought of some kind of all-in-one filter() but struck out...