4

Consider a matrix M1 giving values for all combinations x,y. Consider a partition f(x)->X and a partition g(y)->Y. Furthermore consider an operation p(A) on a set A of numbers, i.e. max(A) or sum(A).

The mappings f,g can be used to create from M1 a block matrix M2 where all x that are mapped to the same X are adjacent, and the same for all y.

This matrix M2 has a block for each combination of the 'sets' X,Y.

Now I would like to condense this matrix M2 into another matrix M3 by applying p on each block separately. M3 has one value for each combination of X,Y.

Ideally, I would like to skip the transformation of M1 into M2 using f and g on the fly.

What would be the most efficient way to perform such operation and would it be possible to deploy numpy or scipy for it?

Special case: Actually, in my case x and y are identical and there is only one function f applied to both of them. I only care about the part of M2 that is under the diagonal.

1
  • Do f and g only work with scalar inputs? Ideally to use numpy you want to write these in a way that works with an array (could be 1d) of values, returning an array of matching size. Otherwise you are stuck with iterating, in one way or other, over elements of M1. What do you hope to gain by skipping M2? Commented Jan 5, 2017 at 22:08

3 Answers 3

4

The most straightforward way I can think of to do this, although perhaps not the most efficient (especially if your matrix is huge), is to convert your matrix to a one-dimensional array, and then have corresponding arrays for the partition group indices X and Y. You can then group by the partition group indices and finally restructure the matrix back into its original form.

For example, if your matrix is

>>> M1 = np.arange(25).reshape((5,5))
>>> M1
array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14],
       [15, 16, 17, 18, 19],
       [20, 21, 22, 23, 24]])

and your partitions are

>>> def f(x):
...     return np.array([1,1,1,2,2])[x]
>>> def g(y):
...     return np.array([3,4,4,4,5])[y]

From that point, there are several ways to implement the reshaping and subsequent grouping. You can do it with Pandas, for instance, by constructing a DataFrame and using its stack() method to "stack" all the rows on top of each other in a single column, indexed by their original row and column indices.

>>> st = pd.DataFrame(M1).stack().to_frame('M1')
>>> st
     M1
0 0   0
  1   1
  2   2
  3   3
  4   4
1 0   5
...
4 3  23
  4  24

(I have truncated the output for readability, and I trust that you can evaluate the rest of these examples yourself if you want to see their output.) You can then add columns representing the partition group indices:

>>> st['X'] = f(st.index.get_level_values(0))
>>> st['Y'] = g(st.index.get_level_values(1))

Then you can group by those indices and apply your aggregation function of choice.

>>> stp = st.groupby(['X', 'Y']).agg(p)

You will have to define p (or find an existing definition) such that it takes a one-dimensional Numpy array and returns a single number. If you want to use something like sum(), you can just use st.groupby(...).sum() because Pandas has built-in support for that and a few other standard functions, but agg is general and works for any reduction function p you can provide.

Finally, the unstack() method will convert the DataFrame back into the properly 2D "matrix form", and then if you want you can use the as_matrix() method to turn it back into a pure Numpy array.

>>> M3 = stp.unstack().as_matrix()
>>> M3
array([[ 15,  63,  27],
       [ 35, 117,  43]])

If you don't want to bring in Pandas, there are other libraries that do the same thing. You might look at numpy-groupies, for example. However I haven't found any library that does true two-dimensional grouping, which you might need if you are working with very large matrices, large enough that having an additional 2 or 3 copies of them would exhaust the available memory.

Sign up to request clarification or add additional context in comments.

6 Comments

I would describe vectorize as an easy way, not a quick way (which implies a speed improvement).
Your f(x) can be written as np.array([1,1,1,2,2])[x], and will run a lot faster than the vectorized version.
This is the first time I've seen @vectorize used as a decorator. It does work, though it doesn't allow parameters like otypes. Often when posters have problems with vectorize (other than speed expectations) it's because they need one of the extra parameters.
@hpaulj It's quick in the sense of the amount of time you need to spend writing the code. I'm well aware that vectorize is not quick to execute, but speed is not a concern here. Anyway, thanks for the suggestion.
It should be quick in running not coding. I think the main problem here is that scipy does not support sparse matrices with more than two dimensions. Otherwise you could always have one more dimensions for each group (4d) and use the function like max() to reduce the dimensions for the members of the groups.
|
1

Let M1 be a numpy n x m array. You can start by determining which partitions you have. The set constructor removes repeated entries, but orders them arbitrarily. I sort them just to have a well-defined ordering:

xs = sorted(set(f(i) for i in range(n)))
ys = sorted(set(g(i) for i in range(m)))

To build a block matrix for each X,Y you can use numpy boolean indexing along with the grid-construction helper ix_ to select only rows and columns belonging to X and Y, respectively. Finally, apply p to the selected submatrix:

from numpy import zeros, arange, ix_

ii, jj = arange(n), arange(m)
M3 = zeros((len(xs), len(ys)))

for k, X in enumerate(xs):
    for l, Y in enumerate(ys):
        M3[k,l] = p(M1[ix_(f(ii) == X, g(jj) == Y)])

The partitions f and g have to apply element-wise to numpy arrays for this to work. As mentioned in the other answer the numpy.vectorize decorator can be used to achieve this.

To give an example:

from __future__ import division
n = m = 5
M1 = np.arange(25).reshape(5,5)
f = lambda x: x // 3      # f(ii) = [0, 0, 0, 1, 1]
g = lambda x: (x+2) // 3  # g(jj) = [0, 1, 1, 1, 2]
p = numpy.sum

M3 = [[  15.,   63.,   27.],
      [  35.,  117.,   43.]]

2 Comments

The problem is that we still have two python loops. I was hoping to do that in C by using numpy...
I don't think that it can be done without loops. An alternative would be to have a 4D array indexed by (X, Y, x, y) and use apply_over_axes(p, a, (2,3)), but that just shifts the loops to the array construction (and is less powerful and constructs M2 explicitly). The loops are over the partitions only, anyways. As long as each contains more than just a handful of elements, the bulk of the computation should go into the evaluation of p.
0

I've encountered with the same problem some years after and in my opinion, the best solution to do this is as follows:

M2 = np.zeros((n,m))
for i in range(n):
    for j in range(m):
        M2[i,j] = p(M1[f(x) == i, :][: , g(y) == j])

This assumes that f takes values on [0,1,..,n-1] and that g takes values on [0,1,..,m-1]

An example would be

import numpy as np

M1 = np.random.random((4,6))

print(M1)

x = range(4)
y = range(6)
p = np.sum


def f(x):
    return np.array([0,0,1,2])[x]

def g(y):
    return np.array([0,1,1,0,1,0])[y]

n = 3 # number of elements in partition f
m = 2 # number of elements in partition g


M2 = np.zeros((n,m))
for i in range(n):
    for j in range(m):
        M2[i,j] = p(M1[f(x) == i, :][: , g(y) == j])


print(M2)

To automate n and m you can use len(set(f(x))) and len(set(g(y)))

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.