3

I am translating a numerical method that I wrote in MATLAB to Python. For some reason the Python code, which is almost identical, runs considerably slower. Here U and V are the unknowns which are solved for at every timestep. The size of U[:,n] and V[:,n] is 700x1. The rest of the variables (dt, A, and denom) are constants. Here is the loop (numpy has been imported as *):

for n in range(0, 400):
    UnVn2 = fft.fft(U[:, n] * V[:, n] ** 3)
    U[:, n +1 ] = fft.ifft((fft.fft(U[:, n]) / dt - UnVn2 + A) / denom)
    V[:, n + 1] = fft.ifft((fft.fft(V[:, n]) / dt + UnVn2) / denom)

Any suggestions? Thanks greatly.

3
  • What python runtime? Is it using a JIT compiler? Are the BLAS and FFT routines as fast as the ones packaged with MatLab? IIRC, MatLab uses Intel and AMD binaries which are very heavily optimized. Commented Jan 10, 2013 at 22:23
  • I am still very new to Python, but I am running the code just by importing the file into IDLE. Is that called JIT? The Python runtime seems to be at least 3 times over MATLAB. I had assumed that the numpy routines would be fully optimized since it is so commonly used for scientific computing. Commented Jan 10, 2013 at 22:32
  • 1
    numpy is capable of using those optimized routines. See software.intel.com/en-us/articles/numpyscipy-with-intel-mkl But use the optimized library most appropriate for your processor. Commented Jan 10, 2013 at 22:36

3 Answers 3

3

See this for instructions on making python and numpy use the same accelerated FFT routines that ship with MATLAB.

If you have an AMD processor, see these instructions instead.

Sign up to request clarification or add additional context in comments.

Comments

2

I am not sure of why Python is slower than Matlab, but...

The FFT, as a fourier transform, has a number of properties, which yield most (all) of your FFT operations unnecessary:

def func1(U, V, dt, denom, A) :
    UnVn2 = np.fft.fft(U * V**3)
    U_ = np.fft.ifft((np.fft.fft(U) / dt - UnVn2 + A) / denom)
    V_ = np.fft.ifft((np.fft.fft(V) / dt + UnVn2) / denom)
    return np.vstack((U_, V_))

def func2(U, V, dt, denom, A) :
    UnVn2 = U * V**3
    U_ = (U / dt - UnVn2) / denom
    U_[0] += A / denom
    V_ = (V / dt + UnVn2) / denom
    return np.vstack((U_, V_))

U = np.random.rand(700)
V = np.random.rand(700)
dt, denom, A = tuple(np.random.rand(3))

>>> func1(U, V, dt, denom, A)
array([[ 2.35201751 -1.11022302e-16j,  0.81099082 -2.45463372e-16j,
         0.48451858 +2.15658782e-18j, ...,  2.23237712 -5.24753851e-16j,
         1.15264205 -2.31140087e-16j,  1.06670009 +1.28369537e-16j],
       [ 2.89314136 +8.67361738e-17j,  3.65612404 -7.80625564e-17j,
         3.31383830 +8.96916836e-17j, ...,  0.90415910 +6.27969898e-16j,
         3.03505664 +4.72358723e-16j,  0.64669863 +4.99600361e-16j]])
>>> func2(U, V, dt, denom, A)
array([[ 2.35201751,  0.81099082,  0.48451858, ...,  2.23237712,
         1.15264205,  1.06670009],
       [ 2.89314136,  3.65612404,  3.3138383 , ...,  0.9041591 ,
         3.03505664,  0.64669863]])
>>> np.max(np.abs(func1(U, V, dt, denom, A) - func2(U, V, dt, denom, A)))
1.5151595604785605e-15

And of course:

>>> import timeit
>>> timeit.timeit('func1(U, V, dt, denom, A)', 'from __main__ import func1, U, V, dt, denom, A', number=400)
0.14169366197616284
>>> timeit.timeit('func2(U, V, dt, denom, A)', 'from __main__ import func2, U, V, dt, denom, A', number=400)
0.06098524703428154

Which I have to admit is less than I was expecting, but it is still almost 3x faster.

EDIT The speed from not doing FFTs seemed too small, so I modified func1 and func2 to return a tuple with (U_, V_) and run the following code:

from time import clock
U = np.zeros((700,400), dtype=np.float)
V = np.zeros((700,400), dtype=np.float)
U[:,0] = np.random.rand(700)
V[:,0] = np.random.rand(700)
dt, denom, A = tuple(np.random.rand(3))
t = clock()
for j in xrange(399) :
    U[:, j+1], V[:, j+1] = func1(U[:, j], V[:, j], dt, denom, A)
print clock() - t
t = clock()
for j in xrange(399) :
    U[:, j+1], V[:, j+1] = func2(U[:, j], V[:, j], dt, denom, A)
print clock() - t

The printed output was 11.5148652438 and 0.321673111194 so the speed-up in the actual problem setting is more like x30.

I also timed pwuertz's proposal, with no significant improvement, 11.1805414552 and 0.297830755317 for the following code:

U = np.zeros((400, 700), dtype=np.float)
V = np.zeros((400, 700), dtype=np.float)
U[0] = np.random.rand(700)
V[0] = np.random.rand(700)
dt, denom, A = tuple(np.random.rand(3))
t = clock()
for j in xrange(399) :
    U[j+1], V[j+1] = func1(U[j], V[j], dt, denom, A)
print clock() - t
t = clock()
for j in xrange(399) :
    U[j+1], V[j+1] = func2(U[j], V[j], dt, denom, A)
print clock() - t

It does look much, much neater, though.

1 Comment

Getting rid of the fft is certainly a big improvement. The impact of memory alignment however depends on the size of the array and the hardware. On my system I'm seeing an improvement of ~30% when switching the axes, and a factor of 2 for larger arrays (2000x2000). It's just a general advise to keep things like this in mind ;)
1

I'm not sure how MatLab organizes the axes in multi-dimensional arrays, but I'm pretty sure numpy uses the C-like row-major order (edit: Wikipedia even mentions that MatLab uses column-major order ;) ).

Since you're operating on single columns only all your operations must iterate over rows. With row-major ordering this is generally less efficient than iterating over whole rows. Consider transposing the layout of your 2d arrays and you should get a noticeable increase in performance.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.