Getting maximum performance on Google Cloud Run for a single thread Python script

Question

I’m running a Google Cloud Run service that exposes an API endpoint performing heavy computational tasks (CPU-bound). The API will be called very occasionally and for now there will never be concurrent requests. I want the execution as fast as possible.

I’ve configured the service with the maximum available CPU and memory settings (8 vCPU, 32 GB memory), but the overall execution time of the API has not improved compared to smaller configurations. From the metrics in Cloud Monitoring, both CPU utilization and memory usage remain consistently low (around 10–20%), even though the process should be CPU-intensive. This makes me suspect the container isn’t getting full CPU utilization or there might be throttling or configuration limiting performance.

Or I am missing something completely?

Expected behavior:

When given a larger CPU and memory configuration, the API process should execute faster and utilize more CPU resources proportionally (ideally 80–90% CPU utilization during heavy computation).

Current configuration:

Platform: Cloud Run (fully managed) - Concurrency: 1 (to isolate computation per request)
CPU allocation: 8 vCPUs - Memory allocation: 32 GB
Execution timeout: 60 minutes
Request duration: ~30–45 minutes
CPU utilization: 10–20% Questions:
Is Cloud Run limiting CPU usage during request processing even when configured with max vCPUs?
Is there a better setup to ensure full CPU utilization for heavy computation workloads?
Are there recommended configurations or patterns for CPU-bound workloads that need to complete large computations under the 60-minute limit?

Additional information:

The code performs multi-threaded or parallelized computation in Python.
No I/O bottleneck observed.
Tested with different machine configurations — performance is mostly unchanged.

I'd me most grateful for pointers as to what I am doing wrong...

Brian Horakh · Accepted Answer · 2025-11-15 22:24:49Z

I ran into this on my own.

The short version: Cloud Run is almost certainly not throttling you here — your code is. You’re basically running a single-core workload on an 8-core machine, so 10–20% CPU on 8 vCPU is exactly what you’d expect from ~1 core doing work.

If you set 8 vCPU and concurrency=1, when a request is running, that instance has up to 8 vCPUs available.

CPU utilization in metrics is as a fraction of total vCPU. So:
- A fully busy single core on an 8-vCPU instance ≈ 12.5% CPU.
- Two fully busy cores ≈ 25% CPU, etc.

More vCPUs don’t help a single-threaded (or GIL-limited) Python workload
This is the key point you already suspected:

Most plain Python CPU-bound code runs in one OS thread.
The Python GIL ensures that only one thread executes Python bytecode at a time, even on multi-core CPUs.
So if your “parallelism” is based on:
- threading.Thread
- ThreadPoolExecutor
- some frameworks that use threads under the hood

Async (async/await) would not fix this either — it’s for I/O concurrency, not for speeding up CPU-bound work.

Is the heavy computation actually parallel across multiple OS processes, or just Python threads in a single process?
(Python threads + CPU-bound work = still 1 core because of the GIL.)

When you say “multi-threaded or parallelized,” what primitives are you using?
- threading / concurrent.futures.ThreadPoolExecutor
- multiprocessing / ProcessPoolExecutor
- NumPy / Numba / other native extensions
  These behave very differently with respect to CPU usage.
If you run exactly the same code locally on an 8-core machine, what does top / Task Manager show?
- One thread pegged at 100% of one core?
- Or 8 workers each at ~100% / N?
What does CPU utilization look like with 1 vCPU vs 8 vCPU?
- If 1 vCPU: do you see ~90–100% during compute?
- If so, that confirms your workload is single-core and just doesn’t benefit from more vCPU.
Have you profiled the code to verify it’s actually CPU-bound in pure Python, and not waiting on some internal I/O or locks?
Is your Cloud Run service using “CPU always allocated” or only during requests?
(For request-time workloads like yours this mostly affects between requests, not during.)
How many worker processes does your app server run?
- e.g. gunicorn workers / uvicorn workers / etc.
- If there’s only 1 worker process handling the request, you will never use >1 core for that single request.
Within that one request, are you starting extra worker processes at all?
- e.g. multiprocessing.cpu_count() then ProcessPoolExecutor(max_workers=8).
Does your algorithm itself parallelize cleanly?
- If there are big serial sections (Amdahl’s law) you might not see much gain even with perfect multi-process usage.

Collectives™ on Stack Overflow

Getting maximum performance on Google Cloud Run for a single thread Python script

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related