how to drop columns missing column names AND data

Question

So, I read CSV-files that are generated using excel. Those can contain empty columns and rows on the right side - resp. below the data range/table. Empty here meaning really empty. So: No column header, no data whatsoever, clearly an artifact.

In a first iteration I just used

pd.read_csv().dropna(axis=1, how='all', inplace=False).dropna(axis='index', how='all', inplace=False)

which seemed to work fine. But it also removes correctly empty columns. Correctly empty here meaning regular columns including a column name, that are really supposed to be empty because that is their data.

I do want to keep all columns that have a proper column name OR contain data -> someone might have just forgotten to give a column name, but it is a proper column

So, per https://stackoverflow.com/a/43983654/2215053 I first used

unnamed_cols_mask = basedata_df2.columns.str.contains('^Unnamed')
basedata_df2.loc[:, ~unnamed_cols_mask] + basedata_df2.loc[:, unnamed_cols_mask].dropna(axis=1, how='all', inplace=False)

which looks and feels clean, but it scrambles the column order.

So now I go with:

df = pd.read_csv().dropna(axis='index', how='all', inplace=False)
df = df[[column_name for column_name in df.columns.array if not column_name.startswith('Unnamed: ') or not df[column_name].isnull().all()]]

Which works. But there should be an obviously right way to accomplish this frequently occuring task? So how could I do this better?

Specifically: Is there a way to make sure the column names starting with 'Unnamed: ' were created by the pd.read_csv() and not originally imported from the csv?

I guess the line df = [[column_name for column_name in df.columns.array if not column_name.startswith('Unnamed: ') or not df[column_name].isnull().all()]] should actually read df = df[[ right? but that doesn't really drop the column, it just creates a slice object which is something like a view on the existing dataframe and behaves a bit different (not all operations are supported by slices). — jottbe
– jottbe, Commented Jan 26, 2021 at 11:42

jottbe · Accepted Answer · 2021-01-26 11:31:04Z

1

Unfortunately, I think there is no built-in function. Also not in pandas.read_csv. But you can apply the following code:

# get all rows which contain only nas
ser_all_na= df.isna().all(axis='rows')
# get all rows which got a generic name Unnamed...
del_indexer= ser_all_na.index.str.startswith('Unnamed: ')
# now delete all columns which got no explicit name and only contain nas
del_indexer&= ser_all_na
df.drop(columns=ser_all_na[del_indexer].index, inplace=True)

edited Jan 26, 2021 at 11:31

answered Jan 26, 2021 at 11:13

jottbe

4,5464 gold badges19 silver badges36 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

POe Over a year ago

Thanks a lot! It works and I think I even understand how :)

Collectives™ on Stack Overflow

how to drop columns missing column names AND data

1 Answer 1

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related