3

I'm hoping to delete columns in my arrays that have repeat entries in row 1 as shown below (row 1 has repeats of values 1 & 2.5, so one of each of those values have been been deleted, together with the column each deleted value lies within).

initial_array =

row 0   [[  1,    1,    1,    1,    1,    1,    1,    1,]
row 1    [0.5,    1,  2.5,    4,  2.5,    2,    1,  3.5,]
row 2    [  1,  1.5,    3,  4.5,    3,  2.5,  1.5,    4,]
row 3    [228,  314,  173,  452,  168,  351,  300,  396]]

final_array =
row 0   [[  1,    1,    1,    1,    1,    1,]
row 1    [0.5,    1,  2.5,    4,    2,  3.5,]
row 2    [  1,  1.5,    3,  4.5,  2.5,    4,]
row 3    [228,  314,  173,  452,  351,  396]]

Ways I was thinking of included using some function that checked for repeats, giving a True response for the second (or more) time a value turned up in the dataset, then using that response to delete the row. That or possibly using the return indices function within numpy.unique. I just can't quite find a way through it or find the right function though.

If I could find a way to return an mean value in the row 3 of the retained repeat and the deleted one, that would be even better (see below).

final_array_averaged =
row 0   [[  1,    1,      1,    1,    1,    1,]
row 1    [0.5,    1,    2.5,    4,    2,  3.5,]
row 2    [  1,  1.5,      3,  4.5,  2.5,    4,]
row 3    [228,  307,  170.5,  452,  351,  396]]

Thanks in advance for any help you can give to a beginner who is stumped!

3 Answers 3

2

You can use the optional arguments that come with np.unique and then use np.bincount to use the last row as weights to get the final averaged output, like so -

_,unqID,tag,C = np.unique(arr[1],return_index=1,return_inverse=1,return_counts=1)
out = arr[:,unqID]
out[-1] = np.bincount(tag,arr[3])/C

Sample run -

In [212]: arr
Out[212]: 
array([[   1. ,    1. ,    1. ,    1. ,    1. ,    1. ,    1. ,    1. ],
       [   0.5,    1. ,    2.5,    4. ,    2.5,    2. ,    1. ,    3.5],
       [   1. ,    1.5,    3. ,    4.5,    3. ,    2.5,    1.5,    4. ],
       [ 228. ,  314. ,  173. ,  452. ,  168. ,  351. ,  300. ,  396. ]])

In [213]: out
Out[213]: 
array([[   1. ,    1. ,    1. ,    1. ,    1. ,    1. ],
       [   0.5,    1. ,    2. ,    2.5,    3.5,    4. ],
       [   1. ,    1.5,    2.5,    3. ,    4. ,    4.5],
       [ 228. ,  307. ,  351. ,  170.5,  396. ,  452. ]])

As can be seen that the output has now an order with the second row being sorted. If you are looking to keep the order as it was originally, use np.argsort of unqID, like so -

In [221]: out[:,unqID.argsort()]
Out[221]: 
array([[   1. ,    1. ,    1. ,    1. ,    1. ,    1. ],
       [   0.5,    1. ,    2.5,    4. ,    2. ,    3.5],
       [   1. ,    1.5,    3. ,    4.5,    2.5,    4. ],
       [ 228. ,  307. ,  170.5,  452. ,  351. ,  396. ]])
Sign up to request clarification or add additional context in comments.

Comments

1

You can find the indices of wanted columns using unique:

>>> indices = np.sort(np.unique(A[1], return_index=True)[1])

Then use a simple indexing to get the desire columns:

>>> A[:,indices]
array([[   1. ,    1. ,    1. ,    1. ,    1. ,    1. ],
       [   0.5,    1. ,    2.5,    4. ,    2. ,    3.5],
       [   1. ,    1.5,    3. ,    4.5,    2.5,    4. ],
       [ 228. ,  314. ,  173. ,  452. ,  351. ,  396. ]])

5 Comments

it is exactly the answer I posted a min ago! but why using np.sort ?
@ColonelBeauvel No it's not, your answer doesn't preserve the order. ;-) Although I didn't see your answer.
It is fully true, there is a problem with the order.
Thanks very much to both of you, yours did the job, but I had to give the answer to@Divakar for creating code that gives the mean result, too.
@georussell Welcome, I think I missed that part, but the Divakar's answer has done the job very well.
1

This is a typical grouping problem, which can be solve elegantly and efficiently using the numpy_indexed package (disclaimer: I am its author):

import numpy_indexed as npi
unique, final_array = npi.group_by(initial_array[1]).mean(initial_array, axis=1)

Note that there are many other reductions than mean; if you want the original behavior you described, you could replace 'mean' with 'first', for instance.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.