0

I have a python list called added that contains 156 individual lists containing two cols references and an array. An example is as follows:

[0, 1, array]

The problem is I have duplicates, although they are not exact as the column references will be flipped. The following two will be exactly the same:

[[0, 1, array], [1, 0, array]]

The way I have tried removing duplicates was to sort the numbers and check if any were the same and if so then append the result to a new list.

Both resulted in separate errors:

for a in range(len(added)):
    added[a][0:2] = added[a][0:2].sort()

TypeError: can only assign an iterable

I also tried to see if the array was in my empty python list no_dups and if it wasnt then append the column refernces and array.:

no_dups = []
for a in range(len(added)):
    if added[a][2] in no_dups:
        print('already appended')
    else:
        no_dups.append(added[a])

<input>:2: DeprecationWarning: elementwise comparison failed; this will raise an error in the future.

Neither worked. I'm struggling to get my head round how to remove duplicates here.

Thanks.

EDIT: reproducible code:

import numpy as np
import pandas as pd
from sklearn import datasets
data = datasets.load_boston()

df = pd.DataFrame(data.data, columns=data.feature_names)
X = df.to_numpy()


cols = []
added = []
for column in X.T:
    cols.append(column)
for i in range(len(cols)):
    for x in range(len(cols)):
        same_check = cols[i] == cols[x]
        if same_check.all() == True:
            continue
        else:
            added.append([i, x, cols[i] * cols[x]])

This code should give you access to the entire created added list.

3
  • Could you provide some example data? A few (<10) lines from your added array would help. Commented May 13, 2020 at 13:46
  • @PaddyHarrison Please see edit in question Commented May 13, 2020 at 14:02
  • That's great, I've edited my answer accordingly. Commented May 13, 2020 at 14:10

3 Answers 3

1

Your first error is because list.sort() sorts in place so it does not return and therefore cannot be assigned. A workaround:

for a in range(len(added)):
    added[a][:2] = sorted(added[a][:2])

You can then get unique indices as:

unique, idx = np.unique([a[:2] for a in added], axis=0, return_index=True)

no_dups = [added[i] for i in idx]

len(added)
>>> 156

len(no_dups)
>>> 78
Sign up to request clarification or add additional context in comments.

Comments

0

You can convert the entire added into a numpy array, then slice the indices and sort them, and then use np.unique to get unique rows.

#dummy added in the form [[a,b,array],[a,b,array],...]
added = [np.random.choice(5,2).tolist()+[np.random.randint(10, size=(1,5))] for i in range(156)]

# Convert to numpy
added_np = np.array(added)
vals, idxs = np.unique(np.sort(added_np[:,:2], axis = 1).astype('int'), axis=0, return_index= True)
added_no_duplicate = added_np[idxs].tolist()

Comments

0
  • As for TypeError: can only assign an iterable:

added[a][0:2].sort() returns None and hence, you cannot assign it to a list. If you want to have the list, you need to use the method sorted() that actually returns the sorted list:

added[a][0:2] = sorted(added[a][0:2])
  • As for <input>:2: DeprecationWarning: elementwise comparison failed; this will raise an error in the future.:

This is a warning and not an error. Nonetheless, this will not work for you because as warning states, your object array does not have a well defined = for it. So when you search if added[a][2] in no_dups, it cannot really compare added[a][2] to elements of no_dups, since equality is not suitably defined. If it is numpy array, you can use:

for a in range(len(added)):
    added[a][0:2] = sorted(added[a][0:2])
no_dups = []
for a in added:
    add_flag = True
    for b in no_dups:
        #to compare lists, compare first two elements using lists and compare array using .all()
        if (a[0:2]==b[0:2]) and ((a[2]==b[2]).all()):
            print('already appended')
            add_flag = False
            break
    if add_flag:
        no_dups.append(a)

len(no_dups):  78
len(added):   156

However, if all your arrays are of same length, you should use numpy stacking which is significantly faster.

4 Comments

I recieve the following error when using the for loop answer:
AttributeError: 'bool' object has no attribute 'all'
If you wish to see the entire data please see edit in post
@geds133 I understood question a bit differently. The error AttributeError: 'bool' object has no attribute 'all' is thrown because you try to compare lists. I updated the answer if you are interested in more knowledge, however, I prefer the accepted answer.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.