0

I have been trying recently to improve the performance (and I mean processing time here) of a piece of code I write with Python3.5 (working on Ubuntu 16.04). My code performs a cosine Fourier transform and I ultimately perform it a lot of time, thus it takes many, many hours...

My laptop is a bit old so I am not convinced multi-threading will help. Anyways, I am more interested in coding the calculation itself to speed things up. Here is the code with my attempt at improving things.

import numpy as np
import time
import math


#== Define my two large numpy arrays ==#
a = np.arange( 200000 )
b = np.arange( 200000 )


#===============#
#== First way ==#
#===============#

t1 = time.time()

#== Loop that performs 1D array calculation 50 times sequentially ==#
for i in range(0, 50):
    a * np.cos( 2 * math.pi * i * b )

t2 = time.time()
print( '\nLoop computation with 1D arrays: ', (t2-t1)*1000, ' ms' )


#================#
#== Second way ==#
#================#

t1 = time.time()

#== One liner to use 1D and 2D arrays at once ==#
a * np.cos( 2 * math.pi * ( np.arange( 50 ) )[:, None] * b )

t2 = time.time()
print( '\nOne liner using both 1D and 2D arrays at once: ', (t2-t1)*1000, ' ms\n' )

I need to perform in this case the calculation 50 times, with large Numpy arrays. I used to do 1D array calculation using a loop to sequentially do it as many time as needed.

I more recently tried to use the power of the Numpy vectorization to do the calculation on-line with 2D array calculation. It turns out the 2D array calculation takes more time, as the output shows you:

Loop computation with 1D arrays:  354.66670989990234  ms

One liner using both 1D and 2D arrays at once:  414.03937339782715  ms

I did not expect that. Maybe given the large arrays, the memory overhead slows down the calculation? Or my laptop CPU is overwhelmed a bit more?

So my question is: what is the most performant/fastest way to proceed for this specific case?

UPDATE: I tried Subhaneil Lahiri's Numba suggestion adding the following lines of code to call it twice (still not storing any results):

#===============#
#== Third way ==#
#===============#

t1 = time.time()

@nb.jit(cache=True)
def cos_matrix(a, b, niter):
    for i in range(niter):
        a * np.cos(2 * math.pi * i * b)

cos_matrix( a, b , 50 )

t2 = time.time()
print( '\nLoop computation using Numba and 1D arrays: ', (t2-t1)*1000, ' ms' )

t1 = time.time()

cos_matrix( a, b , 50 )

t2 = time.time()
print( '\nSecond call to loop computation using Numba and 1D arrays: ', (t2-t1)*1000, ' ms\n' )

And it unfortunately does not improve the result as you can see:

Loop computation with 1D arrays:  366.67585372924805  ms

One liner using both 1D and 2D arrays at once:  417.5834655761719  ms

Loop computation using Numba and 1D arrays:  590.1947021484375  ms

Second call to loop computation using Numba and 1D arrays:  458.58097076416016  ms

Thanks a lot in advance, Antoine.

7
  • In the first case, it looks like you aren't storing the result in the for loop. The difference could come from temporarily allocating memory in the second case. Commented Mar 18, 2019 at 19:58
  • For this example, I indeed did not store the results in order to compare only the computation time. It would seems a good explanation that there is a time overhead in the second case given memory allocation. Why is the memory allocated in the second case? Is not all the computation parallelized magically? And is there a way to improve it? Thanks! Commented Mar 18, 2019 at 20:03
  • All this is happening at the python level. When you multiply the two vectors inside the cosine, the result is stored in a matrix that is passed to np.cos, the result of that is also stored in a matrix that is then discarded. Vectorization moves the loop from python to C. The results are still passed back to python in intermediate steps, which requires storage. Commented Mar 18, 2019 at 20:25
  • A few iterations on a relatively expensive operation (50 v 200000) often are competitive with one larger operation (50*200000). I think there's a trade off between iteration costs and memory management costs. Commented Mar 18, 2019 at 20:25
  • Anyway, the point I was trying to make is that in realistic situations, you'd probably want to store the result so that you can do something with it. In that case, the speed of the for loop could be misleading Commented Mar 18, 2019 at 20:28

2 Answers 2

1

At first think about your input and output datatype. I assume that you want to do the calculation in double precision (float64), but single precision (float32) would be faster.

The second thing to consider is the implementation of the cosine function itself. Python uses by default the implementation it is linked to. In this example I use the Intel- SVML implementation. You may have to install it first, as described in the link.

Please also consider that it makes simply no sense to test a function without output. If you do this a compiler like Numba may optimize away the calculation you are trying to benchmark, or try to show the array on the command window which can take a significant amount of time.

Code

import numpy as np
import time
import math
import numba as nb

@nb.njit(fastmath=True,parallel=True)
def compute_numba(a,b,it):
  res=np.empty((it,a.shape[0]))
  ita=np.arange(0,it)

  for i in nb.prange(ita.shape[0]):
    it=ita[i]
    for j in range(a.shape[0]):
      res[i,j]=a[j] * np.cos( 2. * np.pi * it * b[j])
  return res

#== Define my two large numpy arrays ==#
#Your input type may be float64?
a = np.arange(200000).astype(np.float64)
b = np.arange(200000).astype(np.float64)


#===============#
#== First way ==#
#===============#

t1 = time.time()

#== Loop that performs 1D array calculation 50 times sequentially ==#
res=np.empty((50,a.shape[0]))
for i in range(0, 50):
    res[i,:]=a * np.cos( 2 * math.pi * i * b )

t2 = time.time()
print( '\nLoop computation with 1D arrays: ', (t2-t1)*1000, ' ms' )


#================#
#== Second way ==#
#================#

t1 = time.time()

#== One liner to use 1D and 2D arrays at once ==#
res=a * np.cos( 2 * math.pi * ( np.arange( 50 ) )[:, None] * b )

t2 = time.time()
print( '\nOne liner using both 1D and 2D arrays at once: ', (t2-t1)*1000, ' ms\n' )

#===============#
#== Third way ==#
#===============#
#Don't measure compilation overhead (You will call this functions multiple times?)
res=compute_numba(a,b,50)

t1 = time.time()
res=compute_numba(a,b,50)
t2 = time.time()
print( '\nLoop computation with Numba: ', (t2-t1)*1000, ' ms' )

Output

Core i5-8500

Loop computation with 1D arrays:  176.4671802520752  ms
One liner using both 1D and 2D arrays at once:  151.40032768249512  ms
Loop computation with Numba:  26.036739349365234  ms
Sign up to request clarification or add additional context in comments.

1 Comment

This looks like an excellent suggestion. It will take me a bit of time to try it with my full implementation because I need to migrate some bash scripts that run my python code to python scripts in order not to have to recompile the function everything time I run my python code (currently python code is ran many time from bash). I will keep you posted. Thanks!
0

There are a few tools for speeding up for loops. I think numba is the easiest to use. I've heard that cython is the most effective but harder to use, but I haven't tried it myself. Or in the extreme case you can write a C extension.

Numba: http://numba.pydata.org Cython: https://cython.org

Example of numba:

import numpy as np
import numba as nb

@nb.jit(cache=True)
def cos_matrix(a, b, niter):
    for i in range(niter):
        c = a * np.cos(2 * math.pi * i * b)
        # do something with c...
    return c

This generates & compiles C code the first time it's called.

Edit: not C code, LLVM-IR code, as pointed out by @max9111

5 Comments

This is a "link-only" answer.
better now? any suggestions?
I tried your suggestion, thanks!. You can see my edit in my first post I edited. Unfortunately the use of Numba, given what I did, does not seem to improve the calculation speed :-/.
Numba does not generate C code. It gernerates LLVM-IR code like Clang and compiles it to native code using LLVM-backend (also like Clang)
One idea I am having and trying out: do by hand an approximated cosine calculation to high enough accuracy and hoping to get a speed increase versus numpy.cos.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.