Efficient way of selecting elements of an array based on values from another array in Python?

Question

I have two arrays, e.g. one is labels another is distances:

labels= array([3, 1, 0, 1, 3, 2, 3, 2, 1, 1, 3, 1, 2, 1, 3, 2, 2, 3, 3, 3, 2, 3,
        0, 3, 3, 2, 3, 2, 3, 2,...])

distances = array([2.32284095, 0.36254613, 0.95734965, 0.35429638, 2.79098656,
        5.45921793, 2.63795657, 1.34516461, 1.34028463, 1.10808795,
        1.60549826, 1.42531201, 1.16280383, 1.22517273, 4.48511033,
        0.71543217, 0.98840598,...])

What I want to do is to group the values from distances into N arrays based on the amount of unique label values (in this case N=4). So all values with label = 3 go in one array with label = 2 in another and so on.

I can think of simple brute force with loops and if-conditions but this will incur serious slowdown on large arrays. I feel that there are better ways of doing this by using either native list comprehension or numpy, or something else, just not sure what. What would be best, most efficient approaches?

"Brute force" example for reference, note:(len(labels)==len(distances)):

all_distance_arrays = []
for id in np.unique(labels):

   sorted_distances = []
   
   for index in range(len(labels)):

        if id == labels[index]:

          sorted_distances.append(distances[index])
    
   all_distance_arrays.append(sorted_distances)

Please show the brute force approach as a reference implementation. Right now, you're describing an expected result, not showing it — Mad Physicist
– Mad Physicist, Commented Mar 26, 2022 at 17:16
Also, you really don't need to split the arrays if you do it right. Can you describe or show the intended use-case to avoid an xy problem? — Mad Physicist
– Mad Physicist, Commented Mar 26, 2022 at 17:17
Added an example, use case is classification problem. I am trying to speed up internal operations of classification algorithm and make code neater. — sage
– sage, Commented Mar 26, 2022 at 18:36

user17242583 · Accepted Answer · 2022-03-26 17:14:01Z

2

A simple list comprehension will be nice and fast:

groups = [distances[labels == i] for i in np.unique(labels)]

Output:

>>> groups
[array([0.95734965]),
 array([0.36254613, 0.35429638, 1.34028463, 1.10808795, 1.42531201,
        1.22517273]),
 array([5.45921793, 1.34516461, 1.16280383, 0.71543217, 0.98840598]),
 array([2.32284095, 2.79098656, 2.63795657, 1.60549826, 4.48511033])]

answered Mar 26, 2022 at 17:14

user17242583

Sign up to request clarification or add additional context in comments.

Comments

Ali_Sh · Accepted Answer · 2022-03-26 17:41:17Z

1

By using just NumPy as:

_, counts = np.unique(labels, return_counts=True)  # counts is the repeatation number of each index
sor = labels.argsort()
sections = np.cumsum(counts)                       # end index of slices
labels_sor = np.split(labels[sor], sections)[:-1]
distances_sor = np.split(distances[sor], sections)[:-1]

edited Mar 26, 2022 at 17:41

answered Mar 26, 2022 at 17:35

Ali_Sh

2,8364 gold badges45 silver badges70 bronze badges

Comments

Joffan · Accepted Answer · 2022-03-26 17:48:07Z

0

"Brute force" seems likely to be adequate with a reasonable number of labels:

from collections import defaultdict

dist_group = defaultdict(list)
for lb, ds in zip(labels, distances):
    dist_group[lb].append(ds)

It's hard to tell why this would not fit your purposes.

answered Mar 26, 2022 at 17:48

Joffan

1,4801 gold badge14 silver badges19 bronze badges

1 Comment

sage Over a year ago

Agreed, but as scale increases I encounter significant slowdown from loops and nested conditionals, hence I find that approaches with list comprehension often outperform this and I am looking for fastest solutions.

Mad Physicist · Accepted Answer · 2022-03-26 18:14:55Z

0

You can do this with numpy functions only. First sort the arrays in lockstep (which is what np.unique does under the hood anyway), then split them where the label changes:

i = np.argsort(labels)
labels = labels[i]
distances = distances[i]
splitpoints = np.flatnonzero(np.diff(labels)) + 1
result = np.split(distances, splitpoints)
unique_labels = labels[np.r_[0, split_points]]

answered Mar 26, 2022 at 18:14

Mad Physicist

116k29 gold badges202 silver badges292 bronze badges

Collectives™ on Stack Overflow

Efficient way of selecting elements of an array based on values from another array in Python?

4 Answers 4

Comments

Comments

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Comments

Comments

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related