How to save GPU memory usage in PyTorch

Question

In PyTorch I wrote a very simple CNN discriminator and trained it. Now I need to deploy it to make predictions. But the target machine has a small GPU memory and got out of memory error. So I think that I can set requires_grad = False to prevent PyTorch from storing the gradient values. However I didn't find it making any difference.

There are about 5 millions of parameters in my model. But when predicting a single batch of input, it consumes about 1.2GB of memory. I think there should be no need for such large memory.

The question is how to save GPU memory usage when I just want to use my model to make predictions?

Here is a demo, I use discriminator.requires_grad_ to disable/enable autograd of all parameters. But it seems to be no use.

import torch
import numpy as np
import torch.nn as nn
import torch.nn.functional as functional

from pynvml.smi import nvidia_smi
nvsmi = nvidia_smi.getInstance()

def getMemoryUsage():
    usage = nvsmi.DeviceQuery("memory.used")["gpu"][0]["fb_memory_usage"]
    return "%d %s" % (usage["used"], usage["unit"])

print("Before GPU Memory: %s" % getMemoryUsage())

class Discriminator(nn.Module):
    def __init__(self):
        super().__init__()
        # trainable layers
        # input: 2x256x256
        self.conv1 = nn.Conv2d(2, 8, 5, padding=2) # 8x256x256
        self.pool1 = nn.MaxPool2d(2) # 8x128x128
        self.conv2 = nn.Conv2d(8, 32, 5, padding=2) # 32x128x128
        self.pool2 = nn.MaxPool2d(2) # 32x64x64
        self.conv3 = nn.Conv2d(32, 96, 5, padding=2) # 96x64x64
        self.pool3 = nn.MaxPool2d(4) # 96x16x16
        self.conv4 = nn.Conv2d(96, 256, 5, padding=2) # 256x16x16
        self.pool4 = nn.MaxPool2d(4) # 256x4x4
        self.num_flat_features = 4096
        self.fc1 = nn.Linear(4096, 1024)
        self.fc2 = nn.Linear(1024, 256)
        self.fc3 = nn.Linear(256, 1)
        # loss function
        self.loss = nn.MSELoss()
        # other properties
        self.requires_grad = True
    def forward(self, x):
        y = x
        y = self.conv1(y)
        y = self.pool1(y)
        y = functional.relu(y)
        y = self.conv2(y)
        y = self.pool2(y)
        y = functional.relu(y)
        y = self.conv3(y)
        y = self.pool3(y)
        y = functional.relu(y)
        y = self.conv4(y)
        y = self.pool4(y)
        y = functional.relu(y)
        y = y.view((-1,self.num_flat_features))
        y = self.fc1(y)
        y = functional.relu(y)
        y = self.fc2(y)
        y = functional.relu(y)
        y = self.fc3(y)
        y = torch.sigmoid(y)
        return y
    def predict(self, x, score_th=0.5):
        if len(x.shape) == 3:
            singlebatch = True
            x = x.view([1]+list(x.shape))
        else:
            singlebatch = False
        y = self.forward(x)
        label = (y > float(score_th))
        if singlebatch:
            y = y.view(list(y.shape)[1:])
        return label, y
    def requires_grad_(self, requires_grad=True):
        for parameter in self.parameters():
            parameter.requires_grad_(requires_grad)
        self.requires_grad = requires_grad


x = torch.cuda.FloatTensor(np.zeros([2, 256, 256]))
discriminator = Discriminator()
discriminator.to("cuda:0")

# comment/uncomment this line to make difference
discriminator.requires_grad_(False)

discriminator.predict(x)

print("Requires grad", discriminator.requires_grad)
print("After GPU Memory: %s" % getMemoryUsage())

By comment out the line discriminator.requires_grad_(False), I got output:

Before GPU Memory: 6350MiB
Requires grad True
After GPU Memory: 7547MiB

While by uncomment the line, I got:

Before GPU Memory: 6350MiB
Requires grad False
After GPU Memory: 7543MiB

When I ran your code, the 'before' and 'after' GPU memory usage are both around 900MB. Why is your memory usage so high? — zihaozhihao
– zihaozhihao, Commented Sep 16, 2019 at 23:15

prosti · Accepted Answer · 2019-09-15 10:56:33Z

7

You can use pynvml.

This python tool made Nvidia so you can Python query like this:

from pynvml.smi import nvidia_smi
nvsmi = nvidia_smi.getInstance()
nvsmi.DeviceQuery('memory.free, memory.total')

You can always also execute:

torch.cuda.empty_cache()

To empty the cache and you will find even more free memory that way.

Before calling torch.cuda.empty_cache() if you have objects you don't use anymore you can call this:

obj = None

And after that you call

gc.collect()

edited Sep 15, 2019 at 10:56

answered Sep 15, 2019 at 10:49

prosti

46.9k19 gold badges199 silver badges161 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Cosmo Over a year ago

Thanks for your advice. pynvml is really awesome! However, I don't think torch.cuda.empty_cache() can solve my problem. Of course, recycling after prediction computation will decrease the memory usage at the end. But the peak memory usage won't decrease. And that's the bottleneck in my problem.

prosti Over a year ago

Once you get free memory you can ask adjust the batch size based on that. Currently, you haven't set that. Just you set your image has 2 channels. Maybe this helps.

trsvchn · Accepted Answer · 2019-09-15 09:49:03Z

1

Try to use model.eval() with torch.no_grad() on your target machine when making predictions. model.eval() will switch model layers to eval mode. torch.no_grad() will deactivate autograd engine and as a result memory usage will be reduced.

x = torch.cuda.FloatTensor(np.zeros([2, 256, 256]))
discriminator = Discriminator()
discriminator.to("cuda:0")

discriminator.eval()
with torch.no_grad():
    discriminator.predict(x)

answered Sep 15, 2019 at 9:49

trsvchn

9,1113 gold badges26 silver badges35 bronze badges

1 Comment

Cosmo Over a year ago

Thanks but it seems not to make difference. To my knowledge, model.eval just make differences for specific modules, such as batchnorm or dropout. It tells them to behave as in evaluating mode instead of training mode. But the doc didn't mention that it will tell variables not to keep gradients or some other datas. Besides, I am confused about how does torch know that my submodules should be in evaluation mode as I just call eval on my module. As for requires_grad_ I have to implement this function for my module by my self and set all the submodules' requires_grad as False.

Chemis97 · Accepted Answer · 2022-05-16 19:26:43Z

0

I guess it's not relevant anymore for your specific problem, but you could take a look at Torchscript It's a good way to decrease the size and complexity of your model. It also speeds up the prediction. Unfortunately it cant help with the training itself. It is just in general a good idea for deployment of pytorch models used in other hardware or embedded in c++ code for efficiency. Cheers. :-)

answered May 16, 2022 at 19:26

Chemis97

136 bronze badges

Collectives™ on Stack Overflow

How to save GPU memory usage in PyTorch

3 Answers 3

2 Comments

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related