12

In PyTorch I wrote a very simple CNN discriminator and trained it. Now I need to deploy it to make predictions. But the target machine has a small GPU memory and got out of memory error. So I think that I can set requires_grad = False to prevent PyTorch from storing the gradient values. However I didn't find it making any difference.

There are about 5 millions of parameters in my model. But when predicting a single batch of input, it consumes about 1.2GB of memory. I think there should be no need for such large memory.

The question is how to save GPU memory usage when I just want to use my model to make predictions?


Here is a demo, I use discriminator.requires_grad_ to disable/enable autograd of all parameters. But it seems to be no use.

import torch
import numpy as np
import torch.nn as nn
import torch.nn.functional as functional

from pynvml.smi import nvidia_smi
nvsmi = nvidia_smi.getInstance()

def getMemoryUsage():
    usage = nvsmi.DeviceQuery("memory.used")["gpu"][0]["fb_memory_usage"]
    return "%d %s" % (usage["used"], usage["unit"])

print("Before GPU Memory: %s" % getMemoryUsage())

class Discriminator(nn.Module):
    def __init__(self):
        super().__init__()
        # trainable layers
        # input: 2x256x256
        self.conv1 = nn.Conv2d(2, 8, 5, padding=2) # 8x256x256
        self.pool1 = nn.MaxPool2d(2) # 8x128x128
        self.conv2 = nn.Conv2d(8, 32, 5, padding=2) # 32x128x128
        self.pool2 = nn.MaxPool2d(2) # 32x64x64
        self.conv3 = nn.Conv2d(32, 96, 5, padding=2) # 96x64x64
        self.pool3 = nn.MaxPool2d(4) # 96x16x16
        self.conv4 = nn.Conv2d(96, 256, 5, padding=2) # 256x16x16
        self.pool4 = nn.MaxPool2d(4) # 256x4x4
        self.num_flat_features = 4096
        self.fc1 = nn.Linear(4096, 1024)
        self.fc2 = nn.Linear(1024, 256)
        self.fc3 = nn.Linear(256, 1)
        # loss function
        self.loss = nn.MSELoss()
        # other properties
        self.requires_grad = True
    def forward(self, x):
        y = x
        y = self.conv1(y)
        y = self.pool1(y)
        y = functional.relu(y)
        y = self.conv2(y)
        y = self.pool2(y)
        y = functional.relu(y)
        y = self.conv3(y)
        y = self.pool3(y)
        y = functional.relu(y)
        y = self.conv4(y)
        y = self.pool4(y)
        y = functional.relu(y)
        y = y.view((-1,self.num_flat_features))
        y = self.fc1(y)
        y = functional.relu(y)
        y = self.fc2(y)
        y = functional.relu(y)
        y = self.fc3(y)
        y = torch.sigmoid(y)
        return y
    def predict(self, x, score_th=0.5):
        if len(x.shape) == 3:
            singlebatch = True
            x = x.view([1]+list(x.shape))
        else:
            singlebatch = False
        y = self.forward(x)
        label = (y > float(score_th))
        if singlebatch:
            y = y.view(list(y.shape)[1:])
        return label, y
    def requires_grad_(self, requires_grad=True):
        for parameter in self.parameters():
            parameter.requires_grad_(requires_grad)
        self.requires_grad = requires_grad


x = torch.cuda.FloatTensor(np.zeros([2, 256, 256]))
discriminator = Discriminator()
discriminator.to("cuda:0")

# comment/uncomment this line to make difference
discriminator.requires_grad_(False)

discriminator.predict(x)

print("Requires grad", discriminator.requires_grad)
print("After GPU Memory: %s" % getMemoryUsage())

By comment out the line discriminator.requires_grad_(False), I got output:

Before GPU Memory: 6350MiB
Requires grad True
After GPU Memory: 7547MiB

While by uncomment the line, I got:

Before GPU Memory: 6350MiB
Requires grad False
After GPU Memory: 7543MiB
1
  • When I ran your code, the 'before' and 'after' GPU memory usage are both around 900MB. Why is your memory usage so high? Commented Sep 16, 2019 at 23:15

3 Answers 3

7

You can use pynvml.

This python tool made Nvidia so you can Python query like this:

from pynvml.smi import nvidia_smi
nvsmi = nvidia_smi.getInstance()
nvsmi.DeviceQuery('memory.free, memory.total')

You can always also execute:

torch.cuda.empty_cache()

To empty the cache and you will find even more free memory that way.

Before calling torch.cuda.empty_cache() if you have objects you don't use anymore you can call this:

obj = None

And after that you call

gc.collect()
Sign up to request clarification or add additional context in comments.

2 Comments

Thanks for your advice. pynvml is really awesome! However, I don't think torch.cuda.empty_cache() can solve my problem. Of course, recycling after prediction computation will decrease the memory usage at the end. But the peak memory usage won't decrease. And that's the bottleneck in my problem.
Once you get free memory you can ask adjust the batch size based on that. Currently, you haven't set that. Just you set your image has 2 channels. Maybe this helps.
1

Try to use model.eval() with torch.no_grad() on your target machine when making predictions. model.eval() will switch model layers to eval mode. torch.no_grad() will deactivate autograd engine and as a result memory usage will be reduced.

x = torch.cuda.FloatTensor(np.zeros([2, 256, 256]))
discriminator = Discriminator()
discriminator.to("cuda:0")

discriminator.eval()
with torch.no_grad():
    discriminator.predict(x)

1 Comment

Thanks but it seems not to make difference. To my knowledge, model.eval just make differences for specific modules, such as batchnorm or dropout. It tells them to behave as in evaluating mode instead of training mode. But the doc didn't mention that it will tell variables not to keep gradients or some other datas. Besides, I am confused about how does torch know that my submodules should be in evaluation mode as I just call eval on my module. As for requires_grad_ I have to implement this function for my module by my self and set all the submodules' requires_grad as False.
0

I guess it's not relevant anymore for your specific problem, but you could take a look at Torchscript It's a good way to decrease the size and complexity of your model. It also speeds up the prediction. Unfortunately it cant help with the training itself. It is just in general a good idea for deployment of pytorch models used in other hardware or embedded in c++ code for efficiency. Cheers. :-)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.