1

I am currently trying to parallelize some parts of a Python code using the ray module. Unfortunately, ray does not allow to modify the data in the shared memory by default (at least according to my understanding). This means I would need to perform a numpy.copy() first, which sounds very inefficient to me.

This is a probably very inefficient example:

import numpy as np
import ray

@ray.remote
def mod_arr( arr ):
    arr_cp  = np.copy(arr)
    arr_cp += np.ones(arr_cp.shape)
    return arr_cp

ray.init()
arr = np.zeros( (2,3,4) )
arr = ray.get(mod_arr.remote(arr))

If I omit the np.copy() in the function mod_arr() and try to modify arr instead, I get the following error

ValueError: output array is read-only

Am I using ray completely wrong, or is it not the correct tool for my purpose?

1 Answer 1

2

Because of Python's GIL, multiple threads cannot run in parallel on Python. Therefore all true parallelism is achieved either outside of Python when a module releases GIL, or by using multiprocessing.

In multiprocessing, this memory copy is a normal process. Not only there, but actually in pure functional programming, where arguments to functions are immutable, the solution is to always copy memory when you have to. It has a lot of advantages in stability, while paying an acceptable performance penalty.

Basically, treat these functions as pure functions.

Sign up to request clarification or add additional context in comments.

2 Comments

Thanks for the insight, sounds reasonable (but what do you actually mean with "treat these functions as pure functions"?)
en.wikipedia.org/wiki/Pure_function Pay attention more to the fact that they have no side effects - they don't modify any data anywhere. They only create data.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.