how to find list of columns with same values in a dataframe in python

Question

i am trying to find list of columns in a data frame with same values in columns. there is a package in R whichAreInDouble, trying implement that in python.

df  =   
a b c d e f g h i   
1 2 3 4 1 2 3 4 5  
2 3 4 5 2 3 4 5 6  
3 4 5 6 3 4 5 6 7

it should give me list of columns with same values like

a, e are equal
b,f are equal 
c,g are equal

{(col_1, col_2) for col_1 in df.columns for col_2 in df.columns if col_1 != col_2 and df[col_1].equals(df[col_2])} O(n^2). — Brian
– Brian, Commented Sep 18, 2019 at 17:19
what if there are 1000's of columns? i am working on a huge dataset with 2000 columns. what i thought of is comparing first 10 rows in the two columns and if they match compare next 10 rows. if they dont match move to next column. — Vivek Sthanam
– Vivek Sthanam, Commented Sep 18, 2019 at 17:24

Scott Boston · Accepted Answer · 2019-09-19 02:24:04Z

5

Let's try using itertools and combinations:

from itertools import combinations

[(i, j) for i,j in combinations(df, 2) if df[i].equals(df[j])]

Output:

[('a', 'e'), ('b', 'f'), ('c', 'g'), ('d', 'h')]

answered Sep 19, 2019 at 2:24

Scott Boston

154k15 gold badges160 silver badges207 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Naveen Reddy Marthala · Accepted Answer · 2023-12-12 04:18:23Z

Make a dict/map of combinations of columns with same values.

>>> from itertools import combinations
>>> _dup_ohe_col_pairs = [(i, j) for i,j in combinations(df_dups, 2) if df_dups[i].equals(df_dups[j])]
>>> _dup_ohe_col_pairs = sorted(_dup_ohe_col_pairs, key=lambda x: x[0])

Don't just pass the keys of this _dup_ohe_col_pairs to pandas.DataFrame.drop to drop the columns. If a and b have same values, this dict will have [('a','b'), ('b','a')], so you will end up dropping both of them. Assume what's gonna happen when you have 3, 4 or 5 columns that are similar. Choosing or filtering what to retain from that map gets very difficult.

Here's how you do that:

# from: https://stackoverflow.com/questions/75257052/getting-unique-values-and-their-following-pairs-from-list-of-tuples/75257487#75257487
def get_unique_to_duplicates_map(data):
    # ecs stands for equivalent classes (https://en.wikipedia.org/wiki/Equivalence_class)
    ecs = []
    
    for a, b in data:
        a_ec = next((ec for ec in ecs if a in ec), None)
        b_ec = next((ec for ec in ecs if b in ec), None)
        if a_ec:
            if b_ec:
                # Found equivalence classes for both elements, everything is okay
                if a_ec is not b_ec:
                    # We only need one of them though
                    ecs.remove(b_ec)
                    a_ec.update(b_ec)
            else:
                # Add the new element to the found equivalence class       
                a_ec.add(b)
        else:              
            if b_ec:
                # Add the new element to the found equivalence class
                b_ec.add(a)
            else:                                                   
                # First time we see either of these: make a new equivalence class 
                ecs.append({a, b})

    # Extract a representative element and construct a dictionary
    out = {
        ec.pop(): ec
        for ec in ecs
    }

    # return it
    return out

>>> _unique_to_dups_map = get_unique_to_duplicates_map(data=_dup_ohe_col_pairs)
_unique_to_dups_map
>>> dropped_to_retained_dict = {v_i:k for k,v in _unique_to_dups_map.items() for v_i in v}
>>> dropped_to_retained_dict = {k:v for k, v in sorted(dropped_to_retained_dict.items(), key=lambda item:item[1])}
>>> dropped_to_retained_dict
>>> df_dups.drop(columns=dropped_to_retained_dict.keys(), axis=1, inplace=True)

solution when columns have similar values that are encoded differently:

It may happen that, two columns basically have same values, but are encoded differently. for example:

  b c d e f
1 1 3 4 1 a
2 3 4 5 2 c 
3 2 5 6 3 b
4 3 4 5 2 c  
5 4 5 6 3 d
6 2 4 5 2 b  
7 4 5 6 3 d

In above example, you could see that column f, after label encoding, will have same values as column b. So, how to catch duplicate columns like these? Here you go:

from tqdm import tqdm_notebook

# create an empty dataframe with same index as your dataframe(let's call it train_df), which will be filled with factorized version of original data.
train_enc = pd.DataFrame(index=train_df.index)
# now encode all the features 
for col in tqdm_notebook(train_df.columns):
    train_enc[col] = train_df[col].factorize()[0]
# find and print duplicated columns
dup_cols = {}
# start with one feature
for i, c1 in enumerate(tqdm_notebook(train_enc.columns)):
    # compare it all the remaining features
    for c2 in train_enc.columns[i + 1:]:
        # add the entries to above dict, if matches with the column in first loop
        if c2 not in dup_cols and np.all(train_enc[c1] == train_enc[c2]):
            dup_cols[c2] = c1
# now print dup_cols dictionary would have names of columns as keys that are identical to a column in value.
print(dup_cols)

column names that match with other, when encoded will be printed at stdout.

if you want to drop duplicate columns, you can do:

train_df.drop(columns=dup_cols.keys(), axis=1, inplace=True)

Hrvoje · Accepted Answer · 2021-03-19 17:13:43Z

1

from itertools import combinations

    cols_to_remove=[]
    for i,j in combinations(chk,2):
        if chk[i].equals(chk[j]):
            cols_to_remove.append(j)
    
    chk=chk.drop(cols_to_remove,axis=1)

edited Mar 19, 2021 at 17:13

Hrvoje

15.4k11 gold badges103 silver badges121 bronze badges

answered Mar 9, 2021 at 11:56

waqas ahmed

1051 silver badge4 bronze badges

Collectives™ on Stack Overflow

how to find list of columns with same values in a dataframe in python

3 Answers 3

Comments

solution when columns have similar values that are encoded differently:

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

solution when columns have similar values that are encoded differently:

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related