My problem is not related to the code, it is related to the "GPU memory" listed in the Windows task manager.
Briefly about the problem: I have an RTX4090 video card with 24GB of video memory. My code uses libtorch C++ v2.0.0 and is compiled in MSVC 2019 on Windows 10 x64 (CUDA 11.8). For 16 GB of RAM, my program worked fine. For some simple training example, it took about 5 seconds per epoch. At the same time, the amount of used GPU memory was about 6.3GB. Additionally, for 16GB of RAM, Windows Task Manager showed that the available "GPU memory" was 24 + 7.9 = 31.9GB, where 7.9GB is the "shared GPU memory" (looks like 50% of RAM). The same situation for 24 GB of RAM.
For 32GB of RAM, the total GPU memory is now 24 + 15.9 = 39.9GB, where 15.9GB is now the "shared GPU memory" (looks like 50% of RAM). And now when backward() method is executed, the used GPU memory increases from 4.4 GB to 37.4GB and after this the epochs take a very long time to calculate (instead of 5 secs it is about of 7 min).
The code I run is always the same! The only differences are the amount of some "virtual" video memory (a mixture of GPU memory and 50% of CPU memory, which is called "GPU memory" in Windows Task Manager, while physical video memory is called "dedicated GPU memory"). For my code, when this memory (39.9GB for 32GB of RAM -> 24+15.9=39.9GB) is more than 37.4GB, libtorch automatically allocates about 38GB in this "virtual" video memory to speed up the calculation of gradients (when method forward is called), but since these 37.4GB are not pure (physical) video memory, performance, on the contrary, drops. Moreover, when this "virtual" video memory is less than 37.4GB (for 16GB of RAM -> 24+7.9=32.9GB), then the torch automatically uses only about 6.3GB of physical video memory and the performance is relatively good (there is some increase in "GPU memory" only for the first epoch).
In these pictures, we can observe intensive GPU memory consumption when executing the 'backward' method for the same executable file, which varies depending on the amount of RAM:
So my questions are: how to limit the available GPU memoryin a CUDA or torch? For example, is there a function in CUDA/libtorch that tells libtorch can only use 24GB of GPU memory. Or how to reduce (control) the "shared GPU memory" under Windows (I didn't find the appropriate options in my BIOS)?


