I am trying to calculate a vectorised nested sum
(so effectively doing a separate calculation for each row k)
The fastest way I have come up with is to define a lower triangular matrix of ones to take into account the range on the inner sum
O = np.tril(np.ones((N,N),uint8),-1)
and then evaluate the inner sum with range j=1..N, allowing the use of a dot product
np.einsum('ij,ij->i',V,np.dot(W,O))
This works well, but even with uint8 data type, O requires a lot of memory for N>>10000. Is there a better way using standard numpy/scipy functions? My plan is to run this on a GPU using cupy.
cumsum