Python: Counting identical rows in an array (without any imports)

Question

For example, given:

import numpy as np
data = np.array(
    [[0, 0, 0],
    [0, 1, 1],
    [1, 0, 1],
    [1, 0, 1],
    [0, 1, 1],
    [0, 0, 0]])

I want to get a 3-dimensional array, looking like:

result = array([[[ 2.,  0.],
                 [ 0.,  2.]],

                [[ 0.,  2.],
                 [ 0.,  0.]]])

One way is:

for row in data
    newArray[ row[0] ][ row[1] ][ row[2] ] += 1

What I'm trying to do is the following:

for i in dimension1
   for j in dimension2
      for k in dimension3
          result[i,j,k] = (data[data[data[:,0]==i, 1]==j, 2]==k).sum()

This doesn't seem to work and I would like to achieve the desired result by sticking to my implementation rather than the one mentioned in the beginning (or using any extra imports, eg counter).

Thanks.

I think your first approach is easier to read and certainly faster. — tobias_k
– tobias_k, Commented Feb 6, 2014 at 18:45
@tobias_k I know! I'm just curious to see why the second approach isn't working :) — mihalios
– mihalios, Commented Feb 6, 2014 at 18:46

Ashwini Chaudhary · Accepted Answer · 2014-02-06 19:16:37Z

4

You can also use numpy.histogramdd for this:

>>> np.histogramdd(data, bins=(2, 2, 2))[0]
array([[[ 2.,  0.],
        [ 0.,  2.]],

       [[ 0.,  2.],
        [ 0.,  0.]]])

answered Feb 6, 2014 at 19:16

Ashwini Chaudhary

252k60 gold badges478 silver badges519 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

tobias_k · Accepted Answer · 2014-02-06 19:17:53Z

The problem is that data[data[data[:,0]==i, 1]==j, 2]==k is not what you expect it to be.

Let's take this apart for the case (i,j,k) == (0,0,0)

data[:,0]==0 is [True, True, False, False, True, True], and data[data[:,0]==0] correctly gives us the lines where the first number is 0.

Now from those lines we get the lines where the second number is 0: data[data[:,0]==0, 1]==0, which gives us [True, False, False, True]. And this is the problem. Because if we take those indices from data, i.e., data[data[data[:,0]==0, 1]==0] we do not get the rows where the first and second number are 0, but the 0th and 3rd row instead:

In [51]: data[data[data[:,0]==0, 1]==0]
Out[51]: array([[0, 0, 0],
                [1, 0, 1]])

And if we now filter for the rows where the third number is 0, we get the wrong result w.r.t. the orignal data.

And that's why your approach does not work. For better methods, see the other answers.

Community · Accepted Answer · 2017-05-23 12:28:40Z

You can do something like the following

#Get output dimension and construct output array.
>>> dshape = tuple(data.max(axis=0)+1)
>>> dshape
(2, 2, 2)
>>> out = np.zeros(shape)

If you have numpy 1.8+:

out.flat[np.ravel_multi_index(data.T, dshape)]+=1

Else:

#Get indices and unique the resulting array
>>> inds = np.ravel_multi_index(data.T, dshape)
>>> inds, inverse = np.unique(inds, return_inverse=True)
>>> values = np.bincount(inverse)

>>> values
array([2, 2, 2])

>>> out.flat[inds] = values
>>> out
array([[[ 2.,  0.],
        [ 0.,  2.]],

       [[ 0.,  2.],
        [ 0.,  0.]]])

Numpy versions before numpy 1.7 do not have a add.at attribute and the top code will not work without it. As ravel_multi_index may not be the fastest algorithm ever you can look into taking the unique rows of a numpy array. In effect these two operations should be equivalent.

eric chiang · Accepted Answer · 2014-02-06 19:30:07Z

Don't fear the imports. They're what make Python awesome.

If question assumes that you already have the result matrix.

import numpy as np
data = np.array(
    [[0, 0, 0],
     [0, 1, 1],
     [1, 0, 1],
     [1, 0, 1],
     [0, 1, 1],
     [0, 0, 0]]
)
result = np.zeros((2,2,2))

# range of each dim, aka allowable values for each dim
dim_ranges = zip(np.zeros(result.ndim), np.array(result.shape)-1)
dim_ranges
# Out[]:
#     [(0.0, 2), (0.0, 2), (0.0, 2)]

# Multidimentional histogram will effectively "count" along each dim
sums,_ = np.histogramdd(data,bins=result.shape,range=dim_ranges)
result += sums
result
# Out[]:
#     array([[[ 2.,  0.],
#             [ 0.,  2.]],
#
#            [[ 0.,  2.],
#             [ 0.,  0.]]])

This solution solves for any "result" ndarray, no matter what the shape. Additionally, it works fine even if your "data" ndarray has indices which are out-of-bounds for your result matrix.

Collectives™ on Stack Overflow

Python: Counting identical rows in an array (without any imports)

4 Answers 4

Comments

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest