Limit FastAPI/gunicorn/... worker to certain endpoints to save memory

I have a FastAPI application with multiple endpoints, and each endpoint uses certain memory intensive objects (ML models). This works fine when I only have one worker, but I am worried about memory usage (and to a lesser extent startup time) when I scale to multiple workers.

Is there a way to limit certain workers to certain endpoints only? Then I would only load the objects required for the respective endpoint.

Specifically, assume I have two endpoints using 2 GB each. If I scale to four workers, I need 2 GB x 2 x 4 = 16 GB.

If I say the first two workers only serve the first endpoint, and the second two workers serve the second endpoint, every process only needs to load one of the models! So I would have 2 GB x 4 = 8 GB. This assumes of course that the load is approximately equal, which is the case here.

Alternatives:

One option would be a microservice architecture, where each endpoint is its own application. However, this only came up because I am trying to move away from microservices, because I had problems with the reliability of such an architecture. (E.g. need to have some kind of scheduler, need to forward the HTTP endpoints, high latency due to multiple layers of forwarding, and finally some of the endpoints are little more than return calculation(huge_object[param])).
The option to share the data among workers does not seem technically possible in the general case

asked Feb 20 at 11:23

jdm

10.3k14 gold badges70 silver badges118 bronze badges

This might help. In case you would like to go with the first alternative option mentioned in your question (even for testing purposes), this answer could help you with forwarding requests.

Chris
– Chris

2025-02-20 12:31:21 +00:00
Commented Feb 20 at 12:31
1

You could run multiple gunicorn instances with a different amount of workers, and forward requests to each pool based on the path in your reverse proxy. You can also have a flag in your application whether the models should be loaded or not to keep the ram usage low for the smaller workers. This can still be one application, but just ran in two different pools with different settings.

MatsLindh
– MatsLindh

2025-02-20 13:18:30 +00:00
Commented Feb 20 at 13:18

Add a comment |

0 Your Answer

Sign up or log in

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.

Collectives™ on Stack Overflow

Limit FastAPI/gunicorn/... worker to certain endpoints to save memory

0

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Your Answer

Sign up or log in

Post as a guest

Linked