Delete columns based on repeat value in one row in numpy array

Question

I'm hoping to delete columns in my arrays that have repeat entries in row 1 as shown below (row 1 has repeats of values 1 & 2.5, so one of each of those values have been been deleted, together with the column each deleted value lies within).

initial_array =

row 0   [[  1,    1,    1,    1,    1,    1,    1,    1,]
row 1    [0.5,    1,  2.5,    4,  2.5,    2,    1,  3.5,]
row 2    [  1,  1.5,    3,  4.5,    3,  2.5,  1.5,    4,]
row 3    [228,  314,  173,  452,  168,  351,  300,  396]]

final_array =
row 0   [[  1,    1,    1,    1,    1,    1,]
row 1    [0.5,    1,  2.5,    4,    2,  3.5,]
row 2    [  1,  1.5,    3,  4.5,  2.5,    4,]
row 3    [228,  314,  173,  452,  351,  396]]

Ways I was thinking of included using some function that checked for repeats, giving a True response for the second (or more) time a value turned up in the dataset, then using that response to delete the row. That or possibly using the return indices function within numpy.unique. I just can't quite find a way through it or find the right function though.

If I could find a way to return an mean value in the row 3 of the retained repeat and the deleted one, that would be even better (see below).

final_array_averaged =
row 0   [[  1,    1,      1,    1,    1,    1,]
row 1    [0.5,    1,    2.5,    4,    2,  3.5,]
row 2    [  1,  1.5,      3,  4.5,  2.5,    4,]
row 3    [228,  307,  170.5,  452,  351,  396]]

Thanks in advance for any help you can give to a beginner who is stumped!

Divakar · Accepted Answer · 2016-07-27 08:50:52Z

You can use the optional arguments that come with np.unique and then use np.bincount to use the last row as weights to get the final averaged output, like so -

_,unqID,tag,C = np.unique(arr[1],return_index=1,return_inverse=1,return_counts=1)
out = arr[:,unqID]
out[-1] = np.bincount(tag,arr[3])/C

Sample run -

In [212]: arr
Out[212]: 
array([[   1. ,    1. ,    1. ,    1. ,    1. ,    1. ,    1. ,    1. ],
       [   0.5,    1. ,    2.5,    4. ,    2.5,    2. ,    1. ,    3.5],
       [   1. ,    1.5,    3. ,    4.5,    3. ,    2.5,    1.5,    4. ],
       [ 228. ,  314. ,  173. ,  452. ,  168. ,  351. ,  300. ,  396. ]])

In [213]: out
Out[213]: 
array([[   1. ,    1. ,    1. ,    1. ,    1. ,    1. ],
       [   0.5,    1. ,    2. ,    2.5,    3.5,    4. ],
       [   1. ,    1.5,    2.5,    3. ,    4. ,    4.5],
       [ 228. ,  307. ,  351. ,  170.5,  396. ,  452. ]])

As can be seen that the output has now an order with the second row being sorted. If you are looking to keep the order as it was originally, use np.argsort of unqID, like so -

In [221]: out[:,unqID.argsort()]
Out[221]: 
array([[   1. ,    1. ,    1. ,    1. ,    1. ,    1. ],
       [   0.5,    1. ,    2.5,    4. ,    2. ,    3.5],
       [   1. ,    1.5,    3. ,    4.5,    2.5,    4. ],
       [ 228. ,  307. ,  170.5,  452. ,  351. ,  396. ]])

Kasravnd · Accepted Answer · 2016-07-27 08:41:22Z

1

You can find the indices of wanted columns using unique:

>>> indices = np.sort(np.unique(A[1], return_index=True)[1])

Then use a simple indexing to get the desire columns:

>>> A[:,indices]
array([[   1. ,    1. ,    1. ,    1. ,    1. ,    1. ],
       [   0.5,    1. ,    2.5,    4. ,    2. ,    3.5],
       [   1. ,    1.5,    3. ,    4.5,    2.5,    4. ],
       [ 228. ,  314. ,  173. ,  452. ,  351. ,  396. ]])

answered Jul 27, 2016 at 8:41

Kasravnd

108k19 gold badges167 silver badges195 bronze badges

5 Comments

Colonel Beauvel Over a year ago

it is exactly the answer I posted a min ago! but why using np.sort ?

Kasravnd Over a year ago

@ColonelBeauvel No it's not, your answer doesn't preserve the order. ;-) Although I didn't see your answer.

Colonel Beauvel Over a year ago

It is fully true, there is a problem with the order.

georussell Over a year ago

Thanks very much to both of you, yours did the job, but I had to give the answer to@Divakar for creating code that gives the mean result, too.

Kasravnd Over a year ago

@georussell Welcome, I think I missed that part, but the Divakar's answer has done the job very well.

Eelco Hoogendoorn · Accepted Answer · 2016-07-28 16:06:27Z

1

This is a typical grouping problem, which can be solve elegantly and efficiently using the numpy_indexed package (disclaimer: I am its author):

import numpy_indexed as npi
unique, final_array = npi.group_by(initial_array[1]).mean(initial_array, axis=1)

Note that there are many other reductions than mean; if you want the original behavior you described, you could replace 'mean' with 'first', for instance.

edited Jul 28, 2016 at 16:06

answered Jul 27, 2016 at 9:19

Eelco Hoogendoorn

10.8k1 gold badge46 silver badges43 bronze badges

Collectives™ on Stack Overflow

Delete columns based on repeat value in one row in numpy array

3 Answers 3

Comments

5 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

5 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related