numpy: Compressing block matrix

Question

Consider a matrix M1 giving values for all combinations x,y. Consider a partition f(x)->X and a partition g(y)->Y. Furthermore consider an operation p(A) on a set A of numbers, i.e. max(A) or sum(A).

The mappings f,g can be used to create from M1 a block matrix M2 where all x that are mapped to the same X are adjacent, and the same for all y.

This matrix M2 has a block for each combination of the 'sets' X,Y.

Now I would like to condense this matrix M2 into another matrix M3 by applying p on each block separately. M3 has one value for each combination of X,Y.

Ideally, I would like to skip the transformation of M1 into M2 using f and g on the fly.

What would be the most efficient way to perform such operation and would it be possible to deploy numpy or scipy for it?

Special case: Actually, in my case x and y are identical and there is only one function f applied to both of them. I only care about the part of M2 that is under the diagonal.

Do f and g only work with scalar inputs? Ideally to use numpy you want to write these in a way that works with an array (could be 1d) of values, returning an array of matching size. Otherwise you are stuck with iterating, in one way or other, over elements of M1. What do you hope to gain by skipping M2? — hpaulj
– hpaulj, Commented Jan 5, 2017 at 22:08

Michael · Accepted Answer · 2020-04-02 15:12:08Z

4

The most straightforward way I can think of to do this, although perhaps not the most efficient (especially if your matrix is huge), is to convert your matrix to a one-dimensional array, and then have corresponding arrays for the partition group indices X and Y. You can then group by the partition group indices and finally restructure the matrix back into its original form.

For example, if your matrix is

>>> M1 = np.arange(25).reshape((5,5))
>>> M1
array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14],
       [15, 16, 17, 18, 19],
       [20, 21, 22, 23, 24]])

and your partitions are

>>> def f(x):
...     return np.array([1,1,1,2,2])[x]
>>> def g(y):
...     return np.array([3,4,4,4,5])[y]

From that point, there are several ways to implement the reshaping and subsequent grouping. You can do it with Pandas, for instance, by constructing a DataFrame and using its stack() method to "stack" all the rows on top of each other in a single column, indexed by their original row and column indices.

>>> st = pd.DataFrame(M1).stack().to_frame('M1')
>>> st
     M1
0 0   0
  1   1
  2   2
  3   3
  4   4
1 0   5
...
4 3  23
  4  24

(I have truncated the output for readability, and I trust that you can evaluate the rest of these examples yourself if you want to see their output.) You can then add columns representing the partition group indices:

>>> st['X'] = f(st.index.get_level_values(0))
>>> st['Y'] = g(st.index.get_level_values(1))

Then you can group by those indices and apply your aggregation function of choice.

>>> stp = st.groupby(['X', 'Y']).agg(p)

You will have to define p (or find an existing definition) such that it takes a one-dimensional Numpy array and returns a single number. If you want to use something like sum(), you can just use st.groupby(...).sum() because Pandas has built-in support for that and a few other standard functions, but agg is general and works for any reduction function p you can provide.

Finally, the unstack() method will convert the DataFrame back into the properly 2D "matrix form", and then if you want you can use the as_matrix() method to turn it back into a pure Numpy array.

>>> M3 = stp.unstack().as_matrix()
>>> M3
array([[ 15,  63,  27],
       [ 35, 117,  43]])

If you don't want to bring in Pandas, there are other libraries that do the same thing. You might look at numpy-groupies, for example. However I haven't found any library that does true two-dimensional grouping, which you might need if you are working with very large matrices, large enough that having an additional 2 or 3 copies of them would exhaust the available memory.

edited Apr 2, 2020 at 15:12

Michael

7,8061 gold badge41 silver badges64 bronze badges

answered Jan 5, 2017 at 16:51

David Z

133k29 gold badges264 silver badges284 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

hpaulj Over a year ago

I would describe vectorize as an easy way, not a quick way (which implies a speed improvement).

hpaulj Over a year ago

Your f(x) can be written as np.array([1,1,1,2,2])[x], and will run a lot faster than the vectorized version.

hpaulj Over a year ago

This is the first time I've seen @vectorize used as a decorator. It does work, though it doesn't allow parameters like otypes. Often when posters have problems with vectorize (other than speed expectations) it's because they need one of the extra parameters.

David Z Over a year ago

@hpaulj It's quick in the sense of the amount of time you need to spend writing the code. I'm well aware that vectorize is not quick to execute, but speed is not a concern here. Anyway, thanks for the suggestion.

Radio Controlled Over a year ago

It should be quick in running not coding. I think the main problem here is that scipy does not support sparse matrices with more than two dimensions. Otherwise you could always have one more dimensions for each group (4d) and use the function like max() to reduce the dimensions for the members of the groups.

|

Roland W · Accepted Answer · 2017-01-05 17:52:52Z

1

Let M1 be a numpy n x m array. You can start by determining which partitions you have. The set constructor removes repeated entries, but orders them arbitrarily. I sort them just to have a well-defined ordering:

xs = sorted(set(f(i) for i in range(n)))
ys = sorted(set(g(i) for i in range(m)))

To build a block matrix for each X,Y you can use numpy boolean indexing along with the grid-construction helper ix_ to select only rows and columns belonging to X and Y, respectively. Finally, apply p to the selected submatrix:

from numpy import zeros, arange, ix_

ii, jj = arange(n), arange(m)
M3 = zeros((len(xs), len(ys)))

for k, X in enumerate(xs):
    for l, Y in enumerate(ys):
        M3[k,l] = p(M1[ix_(f(ii) == X, g(jj) == Y)])

The partitions f and g have to apply element-wise to numpy arrays for this to work. As mentioned in the other answer the numpy.vectorize decorator can be used to achieve this.

To give an example:

from __future__ import division
n = m = 5
M1 = np.arange(25).reshape(5,5)
f = lambda x: x // 3      # f(ii) = [0, 0, 0, 1, 1]
g = lambda x: (x+2) // 3  # g(jj) = [0, 1, 1, 1, 2]
p = numpy.sum

M3 = [[  15.,   63.,   27.],
      [  35.,  117.,   43.]]

edited Jan 5, 2017 at 17:52

answered Jan 5, 2017 at 17:36

Roland W

1,47115 silver badges22 bronze badges

2 Comments

Radio Controlled Over a year ago

The problem is that we still have two python loops. I was hoping to do that in C by using numpy...

Roland W Over a year ago

I don't think that it can be done without loops. An alternative would be to have a 4D array indexed by (X, Y, x, y) and use apply_over_axes(p, a, (2,3)), but that just shifts the loops to the array construction (and is less powerful and constructs M2 explicitly). The loops are over the partitions only, anyways. As long as each contains more than just a handful of elements, the bulk of the computation should go into the evaluation of p.

Santiago Armstrong · Accepted Answer · 2020-05-18 00:01:38Z

I've encountered with the same problem some years after and in my opinion, the best solution to do this is as follows:

M2 = np.zeros((n,m))
for i in range(n):
    for j in range(m):
        M2[i,j] = p(M1[f(x) == i, :][: , g(y) == j])

This assumes that f takes values on [0,1,..,n-1] and that g takes values on [0,1,..,m-1]

An example would be

import numpy as np

M1 = np.random.random((4,6))

print(M1)

x = range(4)
y = range(6)
p = np.sum


def f(x):
    return np.array([0,0,1,2])[x]

def g(y):
    return np.array([0,1,1,0,1,0])[y]

n = 3 # number of elements in partition f
m = 2 # number of elements in partition g


M2 = np.zeros((n,m))
for i in range(n):
    for j in range(m):
        M2[i,j] = p(M1[f(x) == i, :][: , g(y) == j])


print(M2)

To automate n and m you can use len(set(f(x))) and len(set(g(y)))

Collectives™ on Stack Overflow

numpy: Compressing block matrix

3 Answers 3

6 Comments

2 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

6 Comments

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related