0

Basically, I have a function that, for each row, sums one extra value above at a time until the sum reaches a given threshold. Once it reaches the given threshold, it takes the resulting slice index and uses it to return the mean of that slice of another column.

import numpy as np

#Random data:
values = np.random.uniform(0,10,300000)
values2 = np.random.uniform(0,10,300000)
output = [0]*len(values)

#Function that operates one one single row and returns the mean
def function(threshold,row):
    slice_sum=0
    i=1
    while slice_sum < threshold:
        slice_sum = values[row-i:row].sum()
        i=i+1        
    mean = values2[row-i:row].mean()
    return mean


#Loop to iterate the function row by row:
for i in range(15,len(values)): #let's just skip the first 15 values, otherwise the loop might get stuck. This issue is not prioritary though.
    output[i] = function(40,i)

This is a simplified version of the loop. It might not look slow, but it is very slow for all intents and practical purposes. So I'm wondering if there's a faster way of achieving this without a for loop.

Thanks

1
  • Just to simplify the question, you don't need to preallocate output to any certain length. Just use output = [function(40, i) for i in range(15, len(values))]. Commented Mar 9, 2020 at 17:19

2 Answers 2

2

You don't need to recompute the sum each time through the loop. You are starting with values[row-1:row] (a single value), and if that is small enough, adding an additional value. Rather than re-summing the same values iteration after iteration, just augment the previous sum with the next value.

def function(threshold, row):
    slice_sum = 0
    for i in range(1, len(values)+1):
        slice_sum += values[row-i]
        if slice_sum >= threshold:
            break
    return values2[row-i-1:row].mean()

This reduces the number of addition operations from O(n^2) to O(n).

Sign up to request clarification or add additional context in comments.

Comments

0

Use searchsorted on a cumulative sum of values to navigate directly to the next group. This will give you O(n log n) performance where n is the number of groups in values:

import numpy as np

def meanBlocks(values,values2,threshold):
    sums = np.cumsum(values)
    i = j = k = 0
    output = np.zeros(values.size)
    while j < values.size:
        s = sums[j]-values[j]+threshold     # s is next cumsum to reach
        i,j = j,np.searchsorted(sums,s)     # position of next increment by threshold 
        output[k] = np.mean(values2[i:j])   # track mean of values2 for range
        k += 1
    return output[:k]

outputs:

values  = np.arange(10)
values2 = np.arange(10)*5
print(values)
print(values2)
print(meanBlocks(values,values2,13))

[0 1 2 3 4 5 6 7 8 9]           #   (0+1+2+3+4)    (5+6)     (7)   ...   
[ 0  5 10 15 20 25 30 35 40 45] # (0,5,10,15,20)  (25,30)    (35)  ...
[10.  27.5 35.  40.  45. ]      #   50/5 = 10    55/2=27.5    35   ...


print("")
values    = np.random.uniform(0,10,300000)
values2   = np.random.uniform(0,10,300000)
print(values)
print(values2)
print(meanBlocks(values,values2,40)) # takes 0.43 sec on my laptop

[6.79333765 2.22880971 1.37706989 ... 8.75649835 2.92422716 5.1280224 ]
[3.56901367 0.15243962 6.76291706 ... 4.47662928 2.61969948 8.0941208 ]
[4.88477774 3.87464821 5.42599828 ... 4.47055786 4.48768768 5.17582407]

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.