Make a dict/map of combinations of columns with same values.
>>> from itertools import combinations
>>> _dup_ohe_col_pairs = [(i, j) for i,j in combinations(df_dups, 2) if df_dups[i].equals(df_dups[j])]
>>> _dup_ohe_col_pairs = sorted(_dup_ohe_col_pairs, key=lambda x: x[0])
Don't just pass the keys of this _dup_ohe_col_pairs to pandas.DataFrame.drop to drop the columns. If a and b have same values, this dict will have [('a','b'), ('b','a')], so you will end up dropping both of them. Assume what's gonna happen when you have 3, 4 or 5 columns that are similar. Choosing or filtering what to retain from that map gets very difficult.
Here's how you do that:
# from: https://stackoverflow.com/questions/75257052/getting-unique-values-and-their-following-pairs-from-list-of-tuples/75257487#75257487
def get_unique_to_duplicates_map(data):
# ecs stands for equivalent classes (https://en.wikipedia.org/wiki/Equivalence_class)
ecs = []
for a, b in data:
a_ec = next((ec for ec in ecs if a in ec), None)
b_ec = next((ec for ec in ecs if b in ec), None)
if a_ec:
if b_ec:
# Found equivalence classes for both elements, everything is okay
if a_ec is not b_ec:
# We only need one of them though
ecs.remove(b_ec)
a_ec.update(b_ec)
else:
# Add the new element to the found equivalence class
a_ec.add(b)
else:
if b_ec:
# Add the new element to the found equivalence class
b_ec.add(a)
else:
# First time we see either of these: make a new equivalence class
ecs.append({a, b})
# Extract a representative element and construct a dictionary
out = {
ec.pop(): ec
for ec in ecs
}
# return it
return out
>>> _unique_to_dups_map = get_unique_to_duplicates_map(data=_dup_ohe_col_pairs)
_unique_to_dups_map
>>> dropped_to_retained_dict = {v_i:k for k,v in _unique_to_dups_map.items() for v_i in v}
>>> dropped_to_retained_dict = {k:v for k, v in sorted(dropped_to_retained_dict.items(), key=lambda item:item[1])}
>>> dropped_to_retained_dict
>>> df_dups.drop(columns=dropped_to_retained_dict.keys(), axis=1, inplace=True)
solution when columns have similar values that are encoded differently:
It may happen that, two columns basically have same values, but are encoded differently. for example:
b c d e f
1 1 3 4 1 a
2 3 4 5 2 c
3 2 5 6 3 b
4 3 4 5 2 c
5 4 5 6 3 d
6 2 4 5 2 b
7 4 5 6 3 d
In above example, you could see that column f, after label encoding, will have same values as column b. So, how to catch duplicate columns like these? Here you go:
from tqdm import tqdm_notebook
# create an empty dataframe with same index as your dataframe(let's call it train_df), which will be filled with factorized version of original data.
train_enc = pd.DataFrame(index=train_df.index)
# now encode all the features
for col in tqdm_notebook(train_df.columns):
train_enc[col] = train_df[col].factorize()[0]
# find and print duplicated columns
dup_cols = {}
# start with one feature
for i, c1 in enumerate(tqdm_notebook(train_enc.columns)):
# compare it all the remaining features
for c2 in train_enc.columns[i + 1:]:
# add the entries to above dict, if matches with the column in first loop
if c2 not in dup_cols and np.all(train_enc[c1] == train_enc[c2]):
dup_cols[c2] = c1
# now print dup_cols dictionary would have names of columns as keys that are identical to a column in value.
print(dup_cols)
column names that match with other, when encoded will be printed at stdout.
if you want to drop duplicate columns, you can do:
train_df.drop(columns=dup_cols.keys(), axis=1, inplace=True)
{(col_1, col_2) for col_1 in df.columns for col_2 in df.columns if col_1 != col_2 and df[col_1].equals(df[col_2])}O(n^2).