1

I have two arrays, e.g. one is labels another is distances:

labels= array([3, 1, 0, 1, 3, 2, 3, 2, 1, 1, 3, 1, 2, 1, 3, 2, 2, 3, 3, 3, 2, 3,
        0, 3, 3, 2, 3, 2, 3, 2,...])

distances = array([2.32284095, 0.36254613, 0.95734965, 0.35429638, 2.79098656,
        5.45921793, 2.63795657, 1.34516461, 1.34028463, 1.10808795,
        1.60549826, 1.42531201, 1.16280383, 1.22517273, 4.48511033,
        0.71543217, 0.98840598,...]) 

What I want to do is to group the values from distances into N arrays based on the amount of unique label values (in this case N=4). So all values with label = 3 go in one array with label = 2 in another and so on.

I can think of simple brute force with loops and if-conditions but this will incur serious slowdown on large arrays. I feel that there are better ways of doing this by using either native list comprehension or numpy, or something else, just not sure what. What would be best, most efficient approaches?

"Brute force" example for reference, note:(len(labels)==len(distances)):

all_distance_arrays = []
for id in np.unique(labels):

   sorted_distances = []
   
   for index in range(len(labels)):

        if id == labels[index]:

          sorted_distances.append(distances[index])
    
   all_distance_arrays.append(sorted_distances)

3
  • 1
    Please show the brute force approach as a reference implementation. Right now, you're describing an expected result, not showing it Commented Mar 26, 2022 at 17:16
  • 2
    Also, you really don't need to split the arrays if you do it right. Can you describe or show the intended use-case to avoid an xy problem? Commented Mar 26, 2022 at 17:17
  • Added an example, use case is classification problem. I am trying to speed up internal operations of classification algorithm and make code neater. Commented Mar 26, 2022 at 18:36

4 Answers 4

2

A simple list comprehension will be nice and fast:

groups = [distances[labels == i] for i in np.unique(labels)]

Output:

>>> groups
[array([0.95734965]),
 array([0.36254613, 0.35429638, 1.34028463, 1.10808795, 1.42531201,
        1.22517273]),
 array([5.45921793, 1.34516461, 1.16280383, 0.71543217, 0.98840598]),
 array([2.32284095, 2.79098656, 2.63795657, 1.60549826, 4.48511033])]
Sign up to request clarification or add additional context in comments.

Comments

1

By using just NumPy as:

_, counts = np.unique(labels, return_counts=True)  # counts is the repeatation number of each index
sor = labels.argsort()
sections = np.cumsum(counts)                       # end index of slices
labels_sor = np.split(labels[sor], sections)[:-1]
distances_sor = np.split(distances[sor], sections)[:-1]

Comments

0

"Brute force" seems likely to be adequate with a reasonable number of labels:

from collections import defaultdict

dist_group = defaultdict(list)
for lb, ds in zip(labels, distances):
    dist_group[lb].append(ds)

It's hard to tell why this would not fit your purposes.

1 Comment

Agreed, but as scale increases I encounter significant slowdown from loops and nested conditionals, hence I find that approaches with list comprehension often outperform this and I am looking for fastest solutions.
0

You can do this with numpy functions only. First sort the arrays in lockstep (which is what np.unique does under the hood anyway), then split them where the label changes:

i = np.argsort(labels)
labels = labels[i]
distances = distances[i]
splitpoints = np.flatnonzero(np.diff(labels)) + 1
result = np.split(distances, splitpoints)
unique_labels = labels[np.r_[0, split_points]]

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.