I think asking how to implement it just using np.where is a bit of an X/Y problem.
So I'll try to explain how I would approach optimizing this function.
My first instinct is to get rid of the for loop, which was the pain point anyway:
import numpy as np
from scipy.stats import logistic
def func1(y, X, thresholds):
ll = 0.0
for row in zip(y, X):
if row[0] == 0:
ll += logistic.logcdf(thresholds[0] - row[1])
elif row[0] == len(thresholds):
ll += logistic.logcdf(row[1] - thresholds[-1])
else:
diff_prob = logistic.cdf(thresholds[row[0]] - row[1]) - \
logistic.cdf(thresholds[row[0] - 1] - row[1])
diff_prob = 10 ** -5 if diff_prob < 10 ** -5 else diff_prob
ll += np.log(diff_prob)
return ll
y = np.array([0, 1, 2])
X = [2, 2, 2]
thresholds = np.array([2, 3])
print(func1(y, X, thresholds))
I have just replaced i with row[0], without changing the semantics of the loop. So that's one for loop less.
Now I would like to have the form of the statements in the different branches of the if-else to be the same. To that end:
import numpy as np
from scipy.stats import logistic
def func2(y, X, thresholds):
ll = 0.0
for row in zip(y, X):
if row[0] == 0:
ll += logistic.logcdf(thresholds[0] - row[1])
elif row[0] == len(thresholds):
ll += logistic.logcdf(row[1] - thresholds[-1])
else:
ll += np.log(
np.maximum(
10 ** -5,
logistic.cdf(thresholds[row[0]] - row[1]) -
logistic.cdf(thresholds[row[0] - 1] - row[1])
)
)
return ll
y = np.array([0, 1, 2])
X = [2, 2, 2]
thresholds = np.array([2, 3])
print(func2(y, X, thresholds))
Now the expression in each branch is of the form ll += expr.
At this piont there are a couple of different paths the optimization can take. You can try to optimize the loop away by writing it as a comprehension, but I suspect that it'll not give you much increase in speed.
An alternate path is to pull the if conditions out of the loop. That is what your intent with np.where was as well:
import numpy as np
from scipy.stats import logistic
def func3(y, X, thresholds):
y_0 = y == 0
y_end = y == len(thresholds)
y_rest = ~(y_0 | y_end)
ll_1 = logistic.logcdf(thresholds[0] - X[ y_0 ])
ll_2 = logistic.logcdf(X[ y_end ] - thresholds[-1])
ll_3 = np.log(
np.maximum(
10 ** -5,
logistic.cdf(thresholds[y[ y_rest ]] - X[ y_rest ]) -
logistic.cdf(thresholds[ y[y_rest] - 1 ] - X[ y_rest])
)
)
return np.sum(ll_1) + np.sum(ll_2) + np.sum(ll_3)
y = np.array([0, 1, 2])
X = np.array([2, 2, 2])
thresholds = np.array([2, 3])
print(func3(y, X, thresholds))
Note that I turned X into an np.array to be able to use fancy indexing on it.
At this point, I'd wager that it is fast enough for my purposes. However, you can stop earlier or beyond this point, depending on your requirements.
On my computer, I get the following results:
y = np.random.random_integers(0, 10, size=(10000,))
X = np.random.random_integers(0, 10, size=(10000,))
thresholds = np.cumsum(np.random.rand(10))
%timeit func(y, X, thresholds) # Original
1 loops, best of 3: 1.51 s per loop
%timeit func1(y, X, thresholds) # Removed for-loop
1 loops, best of 3: 1.46 s per loop
%timeit func2(y, X, thresholds) # Standardized if statements
1 loops, best of 3: 1.5 s per loop
%timeit func3(y, X, thresholds) # Vectorized ~ 500x improvement
100 loops, best of 3: 2.74 ms per loop