12

I'm trying to implement an efficient way of doing concurrent inference in Pytorch.

Right now, I start 2 processes on my GPU (I have only 1 GPU, both process are on the same device). Each process load my Pytorch model and do the inference step.

My problem is that my model takes quite some space on the memory. I have 12Gb of memory on the GPU, and the model takes ~3Gb of memory alone (without the data). Which means together, my 2 processes takes 6Gb of memory just for the model.


Now I was wondering if it's possible to load the model only once, and use this model for inference on 2 different processes. What I want is only 3Gb of memory is consumed by the model, but still have 2 processes.


I came accross this answer mentioning IPC, but as far as I understood it means the process #2 will copy the model from process #1, so I will still end up with 6Gb allocated for the model.

I also checked on the Pytorch documentation, about DataParallel and DistributedDataParallel, but it seems not possible.

This seems to be what I want, but I couldn't find any code example on how to use with Pytorch in inference mode.


I understand this might be difficult to do such a thing for training, but please note I'm only talking about the inference step (the model is in read-only mode, no need to update gradients). With this assumption, I'm not sure if it's possible or not.

9
  • 2
    I don't see why you cannot just use the same (read-only) model for your inference. You can pass different data batches into the same model, the data loading and inferences can be in parallel. Multiple users can also talk to the model through a higher level interface. Where are the bottlenecks that cause you to use two processes? Commented Feb 5, 2020 at 7:43
  • Thanks for your comment @THN. I currently start my 2 processes, load the model in each of them, and infer. Since process cannot share memory, how would you do ? Using threads ? Commented Feb 5, 2020 at 8:20
  • 1
    I would use one process to load one model and do inference. That will work for most purposes. What exactly do you want to achieve? Commented Feb 5, 2020 at 8:54
  • 1
    You can get most of the benefit of concurrency with a single model on a single process, by doing the concurrency in data loading (which is separated from the model running process, this can be done manually; tensorflow has native support for optimal parallel data preloading, you can look into it for an example) and processing (automatically by larger batch). Commented Feb 5, 2020 at 9:34
  • 1
    @THN I didn't know you get most of the benefit of concurrency with a single model on a single process. I thought that, if memory allows it, it's more efficient to load 2 processes, so they can run in parallel. Please post an answer ! Commented Feb 5, 2020 at 23:41

2 Answers 2

2

The GPU itself has many threads. When performing an array/tensor operation, it uses each thread on one or more cells of the array. This is why it seems that an op that can fully utilize the GPU should scale efficiently without multiple processes -- a single GPU kernel is already massively parallelized.

In a comment you mentioned seeing better results with multiple processes in a small benchmark. I'd suggest running the benchmark with more jobs to ensure warmup, ten kernels seems like too small of a test. If you're finding a thorough representative benchmark to run faster consistently though, I'll trust good benchmarks over my intuition.

My understanding is that kernels launched on the default CUDA stream get executed sequentially. If you want them to run in parallel, I think you'd need multiple streams. Looking in the PyTorch code, I see code like getCurrentCUDAStream() in the kernels, which makes me think the GPU will still run any PyTorch code from all processes sequentially.

This NVIDIA discussion suggests this is correct:

https://devtalk.nvidia.com/default/topic/1028054/how-to-launch-cuda-kernel-in-different-processes/

Newer GPUs may be able to run multiple kernels in parallel (using MPI?) but it seems like this is just implemented with time slicing under the hood anyway, so I'm not sure we should expect higher total throughput:

How do I use Nvidia Multi-process Service (MPS) to run multiple non-MPI CUDA applications?

If you do need to share memory from one model across two parallel inference calls, can you just use multiple threads instead of processes, and refer to the same model from both threads?

To actually get the GPU to run multiple kernels in parallel, you may be able to use nn.Parallel in PyTorch. See the discussion here: https://discuss.pytorch.org/t/how-can-l-run-two-blocks-in-parallel/61618/3

Sign up to request clarification or add additional context in comments.

1 Comment

Thanks for the very detailed answer. I definitely have to read all these resources
1

You can get most of the benefit of concurrency with a single model in a single process for (read-only) inference, by doing concurrency in data loading and model inference.

Data loading is separated from the model running process, this can be done manually. As far as I know, tensorflow has some native supports for optimal parallel data preloading, you can look into it for an example.

Model inference is automatically parallel on GPU. You can maximize this concurrency by using larger batches.

From an architectural point of view, multiple users can also talk to the model through a higher level interface.

7 Comments

I'm wondering how following case would be handled : inference takes 2 seconds, 2 users request inference almost at the same time. Request #1 is inferred for 2 seconds. Then request #2 is inferred for 2 seconds. So user #2 had to wait 4 seconds for his request. Isn't it better, in this case, to have 2 process on the GPU ? So user's #2 request takes just 2 seconds since we have a process available.
You should look at the job scheduling problem which is well studied in OS and has several algorithms. In practice, jobs do not come at the same time, so you can process this job while loading another one. If necessary, you can batch the job together, or just process in sequence if the waiting time is negligible, or divide each job if it is too large.
I did some benchmark for my specific case : If 10 clients request a prediction, it takes 0.96 s to serve all of them with 2 processes on the same GPU. The same experiment with only a single process takes 1.42 s
It's good that you actually tested, but note that each result is an anecdote. If all requests come at the same time, and they only consume a negligible part of the GPU, and you process each request separately, then it is certainly that using 2 or more processes would be faster. But there are cases that one process is good enough, such as when requests come at random; or case that one process is better, such as when the model is large and the request can be batched together. After all you need to look at your own typical use case, find the bottlenecks, and decide where to optimize.
Using multiple CPU processes to read requests, load data, and batch them together, then run it on one GPU process, is the same as your original question about sharing memory (which is actually model params) on GPU. You still need to work for it.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.