Given a pandas dataframe containing possible NaN values scattered here and there:
Question: How do I determine which columns contain NaN values? In particular, can I get a list of the column names containing NaNs?
UPDATE: using Pandas 0.22.0
Newer Pandas versions have new methods 'DataFrame.isna()' and 'DataFrame.notna()'
In [71]: df
Out[71]:
a b c
0 NaN 7.0 0
1 0.0 NaN 4
2 2.0 NaN 4
3 1.0 7.0 0
4 1.0 3.0 9
5 7.0 4.0 9
6 2.0 6.0 9
7 9.0 6.0 4
8 3.0 0.0 9
9 9.0 0.0 1
In [72]: df.isna().any()
Out[72]:
a True
b True
c False
dtype: bool
as list of columns:
In [74]: df.columns[df.isna().any()].tolist()
Out[74]: ['a', 'b']
to select those columns (containing at least one NaN value):
In [73]: df.loc[:, df.isna().any()]
Out[73]:
a b
0 NaN 7.0
1 0.0 NaN
2 2.0 NaN
3 1.0 7.0
4 1.0 3.0
5 7.0 4.0
6 2.0 6.0
7 9.0 6.0
8 3.0 0.0
9 9.0 0.0
OLD answer:
Try to use isnull():
In [97]: df
Out[97]:
a b c
0 NaN 7.0 0
1 0.0 NaN 4
2 2.0 NaN 4
3 1.0 7.0 0
4 1.0 3.0 9
5 7.0 4.0 9
6 2.0 6.0 9
7 9.0 6.0 4
8 3.0 0.0 9
9 9.0 0.0 1
In [98]: pd.isnull(df).sum() > 0
Out[98]:
a True
b True
c False
dtype: bool
or as @root proposed clearer version:
In [5]: df.isnull().any()
Out[5]:
a True
b True
c False
dtype: bool
In [7]: df.columns[df.isnull().any()].tolist()
Out[7]: ['a', 'b']
to select a subset - all columns containing at least one NaN value:
In [31]: df.loc[:, df.isnull().any()]
Out[31]:
a b
0 NaN 7.0
1 0.0 NaN
2 2.0 NaN
3 1.0 7.0
4 1.0 3.0
5 7.0 4.0
6 2.0 6.0
7 9.0 6.0
8 3.0 0.0
9 9.0 0.0
df.columns[df.isin['xxx'].any()].tolist()df.columns[df.eq(search_for_value).any()].tolist()isna, notna ?You can use df.isnull().sum(). It shows all columns and the total NaNs of each feature.
I had a problem where I had to many columns to visually inspect on the screen so a shortlist comp that filters and returns the offending columns is
nan_cols = [i for i in df.columns if df[i].isnull().any()]
if that's helpful to anyone
Adding to that if you want to filter out columns having more nan values than a threshold, say 85% then use
nan_cols85 = [i for i in df.columns if df[i].isnull().sum() > 0.85*len(data)]
This worked for me,
1. For getting Columns having at least 1 null value. (column names)
data.columns[data.isnull().any()]
2. For getting Columns with count, with having at least 1 null value.
data[data.columns[data.isnull().any()]].isnull().sum()
[Optional] 3. For getting percentage of the null count.
data[data.columns[data.isnull().any()]].isnull().sum() * 100 / data.shape[0]
I know this is a very well-answered question but I wanted to add a slight adjustment. This answer only returns columns containing nulls, and also still shows the count of the nulls.
pd.isnull(df).sum()[pd.isnull(df).sum() > 0]
null_count_ser = pd.isnull(df).sum()
is_null_ser = null_count_ser > 0
null_count_ser[is_null_ser]
name 5
phone 187
age 644
In datasets having large number of columns its even better to see how many columns contain null values and how many don't.
print("No. of columns containing null values")
print(len(df.columns[df.isna().any()]))
print("No. of columns not containing null values")
print(len(df.columns[df.notna().all()]))
print("Total no. of columns in the dataframe")
print(len(df.columns))
For example in my dataframe it contained 82 columns, of which 19 contained at least one null value.
Further you can also automatically remove cols and rows depending on which has more null values
Here is the code which does this intelligently:
df = df.drop(df.columns[df.isna().sum()>len(df.columns)],axis = 1)
df = df.dropna(axis = 0).reset_index(drop=True)
Note: Above code removes all of your null values. If you want null values, process them before.
This is one of the methods..
import pandas as pd
df = pd.DataFrame({'a':[1,2,np.nan], 'b':[np.nan,1,np.nan],'c':[np.nan,2,np.nan], 'd':[np.nan,np.nan,np.nan]})
print(pd.isnull(df).sum())
To see just the columns containing NaNs and just the rows containing NaNs:
isnulldf = df.isnull()
columns_containing_nulls = isnulldf.columns[isnulldf.any()]
rows_containing_nulls = df[isnulldf[columns_containing_nulls].any(axis='columns')].index
only_nulls_df = df[columns_containing_nulls].loc[rows_containing_nulls]
print(only_nulls_df)
features_with_na=[features for features in dataframe.columns if dataframe[features].isnull().sum()>0]
for feature in features_with_na: print(feature, np.round(dataframe[feature].isnull().mean(), 4), '% missing values') print(features_with_na)
If you want to write it as a one-liner (could be useful if functions need to be called sequentially in a pipeline), then you can do so using either pipe() or passing a callable to loc[]. pipe() can be used to get the columns with NaN values as well.
df.isna().any().pipe(lambda x: x.index[x])
df.isna().any().loc[lambda x: x].index
A working example:
df = pd.DataFrame({
'a': [1, 2, pd.NA],
'b': [10, 20, 30],
'c': [pd.NA, 'B', 'C']
})
df.isna().any().pipe(lambda x: x.index[x]) # Index(['a', 'c'], dtype='object')
df.isna().any().loc[lambda x: x].index # Index(['a', 'c'], dtype='object')
df.isna().any().pipe(lambda x: df.loc[:, x])
a c
0 1 <NA>
1 2 B
2 <NA> C
If you want to opposite, i.e. columns without any NaN, then notna().all() could be used instead of isna().any().
df.notna().all().pipe(lambda x: x.index[x]) # Index(['b'], dtype='object')
df.isna().any()[lambda x: x]works for me