Finding duplicates in python list containing arrays

Question

I have a python list called added that contains 156 individual lists containing two cols references and an array. An example is as follows:

[0, 1, array]

The problem is I have duplicates, although they are not exact as the column references will be flipped. The following two will be exactly the same:

[[0, 1, array], [1, 0, array]]

The way I have tried removing duplicates was to sort the numbers and check if any were the same and if so then append the result to a new list.

Both resulted in separate errors:

for a in range(len(added)):
    added[a][0:2] = added[a][0:2].sort()

TypeError: can only assign an iterable

I also tried to see if the array was in my empty python list no_dups and if it wasnt then append the column refernces and array.:

no_dups = []
for a in range(len(added)):
    if added[a][2] in no_dups:
        print('already appended')
    else:
        no_dups.append(added[a])

<input>:2: DeprecationWarning: elementwise comparison failed; this will raise an error in the future.

Neither worked. I'm struggling to get my head round how to remove duplicates here.

Thanks.

EDIT: reproducible code:

import numpy as np
import pandas as pd
from sklearn import datasets
data = datasets.load_boston()

df = pd.DataFrame(data.data, columns=data.feature_names)
X = df.to_numpy()


cols = []
added = []
for column in X.T:
    cols.append(column)
for i in range(len(cols)):
    for x in range(len(cols)):
        same_check = cols[i] == cols[x]
        if same_check.all() == True:
            continue
        else:
            added.append([i, x, cols[i] * cols[x]])

This code should give you access to the entire created added list.

Could you provide some example data? A few (<10) lines from your added array would help. — Paddy Harrison, Commented May 13, 2020 at 13:46

Paddy Harrison · Accepted Answer · 2020-05-13 14:09:46Z

1

Your first error is because list.sort() sorts in place so it does not return and therefore cannot be assigned. A workaround:

for a in range(len(added)):
    added[a][:2] = sorted(added[a][:2])

You can then get unique indices as:

unique, idx = np.unique([a[:2] for a in added], axis=0, return_index=True)

no_dups = [added[i] for i in idx]

len(added)
>>> 156

len(no_dups)
>>> 78

edited May 13, 2020 at 14:09

answered May 13, 2020 at 13:51

Paddy Harrison

2,0121 gold badge10 silver badges25 bronze badges

Add a comment |

Mercury · Accepted Answer · 2020-05-13 13:54:27Z

0

You can convert the entire added into a numpy array, then slice the indices and sort them, and then use np.unique to get unique rows.

#dummy added in the form [[a,b,array],[a,b,array],...]
added = [np.random.choice(5,2).tolist()+[np.random.randint(10, size=(1,5))] for i in range(156)]

# Convert to numpy
added_np = np.array(added)
vals, idxs = np.unique(np.sort(added_np[:,:2], axis = 1).astype('int'), axis=0, return_index= True)
added_no_duplicate = added_np[idxs].tolist()

answered May 13, 2020 at 13:54

Mercury

4,1811 gold badge14 silver badges43 bronze badges

Add a comment |

Ehsan · Accepted Answer · 2020-05-14 00:30:23Z

As for TypeError: can only assign an iterable:

added[a][0:2].sort() returns None and hence, you cannot assign it to a list. If you want to have the list, you need to use the method sorted() that actually returns the sorted list:

added[a][0:2] = sorted(added[a][0:2])

As for <input>:2: DeprecationWarning: elementwise comparison failed; this will raise an error in the future.:

This is a warning and not an error. Nonetheless, this will not work for you because as warning states, your object array does not have a well defined = for it. So when you search if added[a][2] in no_dups, it cannot really compare added[a][2] to elements of no_dups, since equality is not suitably defined. If it is numpy array, you can use:

for a in range(len(added)):
    added[a][0:2] = sorted(added[a][0:2])
no_dups = []
for a in added:
    add_flag = True
    for b in no_dups:
        #to compare lists, compare first two elements using lists and compare array using .all()
        if (a[0:2]==b[0:2]) and ((a[2]==b[2]).all()):
            print('already appended')
            add_flag = False
            break
    if add_flag:
        no_dups.append(a)

len(no_dups):  78
len(added):   156

However, if all your arrays are of same length, you should use numpy stacking which is significantly faster.

I recieve the following error when using the for loop answer: — geds133, Commented May 13, 2020 at 14:09
@geds133 I understood question a bit differently. The error AttributeError: 'bool' object has no attribute 'all' is thrown because you try to compare lists. I updated the answer if you are interested in more knowledge, however, I prefer the accepted answer. — Ehsan, Commented May 14, 2020 at 0:28

Collectives™ on Stack Overflow

Finding duplicates in python list containing arrays

3 Answers 3

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Related