I am doing a reinforcement learning simulation in Python that involves running many episodes and updating the weights of the agent per episode. To speed up the process, I wish to run parallel simulations using random numbers and update the weights in parallel.
The problem is, I noticed that even when using the experimental GIL-free Python 3.13t, the main bottleneck is in the random number generation. Basically, when I use N threads to run N simulations, it takes N times longer as if the random number generation were not parallelized.
Here is a simple program to illustrate this:
# test.py
import threading
import time
import random
def generate_randoms(n):
"""Generate n random floats between 0 and 1"""
rng = random.Random()
for i in range(n):
rng.random()
def run_tasks_in_parallel(num_tasks, numbers):
threads = []
for i in range(num_tasks):
t = threading.Thread(target=generate_randoms, args=(numbers[i],))
threads.append(t)
t.start()
for t in threads:
t.join()
if __name__ == "__main__":
num_tasks = 2
numbers = [1000000] * 2
start_time = time.time()
run_tasks_in_parallel(num_tasks, numbers)
end_time = time.time()
print(f"Time taken: {end_time - start_time} seconds")
Benchmarking results: (num_tasks=4 unless stated otherwise)
python3.13 test.py: ~11s
python3.13t test.py: ~6s
python3.13t -X gil=0 test.py: ~6s
python3.13t -X gil=1 test.py: ~16s
python3.13 test.py [num_tasks=1]: ~1s
python3.13t test.py [num_tasks=1]: ~1s
The program takes 2 times as long with num_tasks=2 than with num_takes=1. Using numpy.random.default_rng takes even longer. I thought it was a problem of shared random state, so I instantiated an rng for each thread as above, but the problem persisted. I have ruled out Python issues since removing the random number generation above allows for parallelism (n threads do not take n times as long without random number generation).
Is there a way to parallelize random number generation in GIL-free Python? Why doesn't random number generation (both CPython and Numpy) scale with threads?
I do not need the threads to share random state (I even prefer them not to), nor do I need thread-safe random number generation. I just need parallel random numbers to speed up my training.
sysconfig.get_config_var('CONFIG_ARGS')sys._is_gil_enabled()?False"'--prefix=~/.pyenv/versions/3.13.2t' '--enable-shared' '--libdir=~/.pyenv/versions/3.13.2t/lib' '--disable-gil' 'LDFLAGS=-L~/.pyenv/versions/3.13.2t/lib -Wl,-rpath,~/.pyenv/versions/3.13.2t/lib' 'LIBS=-L~/.pyenv/versions/3.13.2t/lib -Wl,-rpath,~/.pyenv/versions/3.13.2t/lib' 'CPPFLAGS=-I~/.pyenv/versions/3.13.2t/include'"python3.13 test.py,python3.13t test.py,python3.13t -X gil=0 test.py,python3.13t -X gil=1 test.py