4

i am trying to find list of columns in a data frame with same values in columns. there is a package in R whichAreInDouble, trying implement that in python.

df  =   
a b c d e f g h i   
1 2 3 4 1 2 3 4 5  
2 3 4 5 2 3 4 5 6  
3 4 5 6 3 4 5 6 7

it should give me list of columns with same values like

a, e are equal
b,f are equal 
c,g are equal
2
  • {(col_1, col_2) for col_1 in df.columns for col_2 in df.columns if col_1 != col_2 and df[col_1].equals(df[col_2])} O(n^2). Commented Sep 18, 2019 at 17:19
  • what if there are 1000's of columns? i am working on a huge dataset with 2000 columns. what i thought of is comparing first 10 rows in the two columns and if they match compare next 10 rows. if they dont match move to next column. Commented Sep 18, 2019 at 17:24

3 Answers 3

5

Let's try using itertools and combinations:

from itertools import combinations

[(i, j) for i,j in combinations(df, 2) if df[i].equals(df[j])]

Output:

[('a', 'e'), ('b', 'f'), ('c', 'g'), ('d', 'h')]
Sign up to request clarification or add additional context in comments.

Comments

2

Make a dict/map of combinations of columns with same values.

>>> from itertools import combinations
>>> _dup_ohe_col_pairs = [(i, j) for i,j in combinations(df_dups, 2) if df_dups[i].equals(df_dups[j])]
>>> _dup_ohe_col_pairs = sorted(_dup_ohe_col_pairs, key=lambda x: x[0])

Don't just pass the keys of this _dup_ohe_col_pairs to pandas.DataFrame.drop to drop the columns. If a and b have same values, this dict will have [('a','b'), ('b','a')], so you will end up dropping both of them. Assume what's gonna happen when you have 3, 4 or 5 columns that are similar. Choosing or filtering what to retain from that map gets very difficult.

Here's how you do that:

# from: https://stackoverflow.com/questions/75257052/getting-unique-values-and-their-following-pairs-from-list-of-tuples/75257487#75257487
def get_unique_to_duplicates_map(data):
    # ecs stands for equivalent classes (https://en.wikipedia.org/wiki/Equivalence_class)
    ecs = []
    
    for a, b in data:
        a_ec = next((ec for ec in ecs if a in ec), None)
        b_ec = next((ec for ec in ecs if b in ec), None)
        if a_ec:
            if b_ec:
                # Found equivalence classes for both elements, everything is okay
                if a_ec is not b_ec:
                    # We only need one of them though
                    ecs.remove(b_ec)
                    a_ec.update(b_ec)
            else:
                # Add the new element to the found equivalence class       
                a_ec.add(b)
        else:              
            if b_ec:
                # Add the new element to the found equivalence class
                b_ec.add(a)
            else:                                                   
                # First time we see either of these: make a new equivalence class 
                ecs.append({a, b})

    # Extract a representative element and construct a dictionary
    out = {
        ec.pop(): ec
        for ec in ecs
    }

    # return it
    return out

>>> _unique_to_dups_map = get_unique_to_duplicates_map(data=_dup_ohe_col_pairs)
_unique_to_dups_map
>>> dropped_to_retained_dict = {v_i:k for k,v in _unique_to_dups_map.items() for v_i in v}
>>> dropped_to_retained_dict = {k:v for k, v in sorted(dropped_to_retained_dict.items(), key=lambda item:item[1])}
>>> dropped_to_retained_dict
>>> df_dups.drop(columns=dropped_to_retained_dict.keys(), axis=1, inplace=True)

solution when columns have similar values that are encoded differently:

It may happen that, two columns basically have same values, but are encoded differently. for example:

  b c d e f
1 1 3 4 1 a
2 3 4 5 2 c 
3 2 5 6 3 b
4 3 4 5 2 c  
5 4 5 6 3 d
6 2 4 5 2 b  
7 4 5 6 3 d

In above example, you could see that column f, after label encoding, will have same values as column b. So, how to catch duplicate columns like these? Here you go:

from tqdm import tqdm_notebook

# create an empty dataframe with same index as your dataframe(let's call it train_df), which will be filled with factorized version of original data.
train_enc = pd.DataFrame(index=train_df.index)
# now encode all the features 
for col in tqdm_notebook(train_df.columns):
    train_enc[col] = train_df[col].factorize()[0]
# find and print duplicated columns
dup_cols = {}
# start with one feature
for i, c1 in enumerate(tqdm_notebook(train_enc.columns)):
    # compare it all the remaining features
    for c2 in train_enc.columns[i + 1:]:
        # add the entries to above dict, if matches with the column in first loop
        if c2 not in dup_cols and np.all(train_enc[c1] == train_enc[c2]):
            dup_cols[c2] = c1
# now print dup_cols dictionary would have names of columns as keys that are identical to a column in value.
print(dup_cols)

column names that match with other, when encoded will be printed at stdout.

if you want to drop duplicate columns, you can do:

train_df.drop(columns=dup_cols.keys(), axis=1, inplace=True)

Comments

1
from itertools import combinations

    cols_to_remove=[]
    for i,j in combinations(chk,2):
        if chk[i].equals(chk[j]):
            cols_to_remove.append(j)
    
    chk=chk.drop(cols_to_remove,axis=1)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.