2

So, I read CSV-files that are generated using excel. Those can contain empty columns and rows on the right side - resp. below the data range/table. Empty here meaning really empty. So: No column header, no data whatsoever, clearly an artifact.

In a first iteration I just used

pd.read_csv().dropna(axis=1, how='all', inplace=False).dropna(axis='index', how='all', inplace=False) 

which seemed to work fine. But it also removes correctly empty columns. Correctly empty here meaning regular columns including a column name, that are really supposed to be empty because that is their data.

I do want to keep all columns that have a proper column name OR contain data -> someone might have just forgotten to give a column name, but it is a proper column

So, per https://stackoverflow.com/a/43983654/2215053 I first used

unnamed_cols_mask = basedata_df2.columns.str.contains('^Unnamed')
basedata_df2.loc[:, ~unnamed_cols_mask] + basedata_df2.loc[:, unnamed_cols_mask].dropna(axis=1, how='all', inplace=False)

which looks and feels clean, but it scrambles the column order.

So now I go with:

df = pd.read_csv().dropna(axis='index', how='all', inplace=False)
df = df[[column_name for column_name in df.columns.array if not column_name.startswith('Unnamed: ') or not df[column_name].isnull().all()]]

Which works. But there should be an obviously right way to accomplish this frequently occuring task? So how could I do this better?

Specifically: Is there a way to make sure the column names starting with 'Unnamed: ' were created by the pd.read_csv() and not originally imported from the csv?

1
  • I guess the line df = [[column_name for column_name in df.columns.array if not column_name.startswith('Unnamed: ') or not df[column_name].isnull().all()]] should actually read df = df[[ right? but that doesn't really drop the column, it just creates a slice object which is something like a view on the existing dataframe and behaves a bit different (not all operations are supported by slices). Commented Jan 26, 2021 at 11:42

1 Answer 1

1

Unfortunately, I think there is no built-in function. Also not in pandas.read_csv. But you can apply the following code:

# get all rows which contain only nas
ser_all_na= df.isna().all(axis='rows')
# get all rows which got a generic name Unnamed...
del_indexer= ser_all_na.index.str.startswith('Unnamed: ')
# now delete all columns which got no explicit name and only contain nas
del_indexer&= ser_all_na
df.drop(columns=ser_all_na[del_indexer].index, inplace=True)
Sign up to request clarification or add additional context in comments.

1 Comment

Thanks a lot! It works and I think I even understand how :)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.