pytorch Module B=A, A.to('cpu'), but the tensor in B is still in GPU, why?

Question

After converting module A to CPU, the origin parameter tensor still stays on the GPU? When it is released? Is it wrong if I reuse the parameter?

My code:

import torch.nn as nn

class A(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc = nn.Linear(10, 5)
        self.relu = nn.ReLU()

    def forward(self, x):
        return self.relu(self.fc(x))

a = A().to('cuda')

weight = {}
for key, value in a.state_dict().items():
    weight[key] = value

a.to('cpu')
print("a.state_dict() device:", [t.device for t in a.state_dict().values()])  # in CPU
print("weight device:", [t.device for t in weight.values()])  # still in GPU

Result:

a.state_dict() device: [device(type='cpu'), device(type='cpu'), device(type='cpu'), device(type='cpu')]

weight device: [device(type='cuda', index=0), device(type='cuda', index=0), device(type='cuda', index=0), device(type='cuda', index=0)]

Why are the tensors in weight still on the GPU?

FoxWise · Accepted Answer · 2025-11-21 10:33:50Z

0

Python for loop creates copies of values, it does not reference original values in a.

Here is a small reproducible without tensors.

original_dict = {'a': 1, 'b': 2, 'c': 3}

new_dict = {}
for key, value in original_dict.items():
    new_dict[key] = value

print("Original dict:", original_dict) # {'a': 1, 'b': 2, 'c': 3}
print("New dict:", new_dict) # {'a': 1, 'b': 2, 'c': 3}

# Changing original dict values to show they are independent
original_dict['a'] = 10
original_dict['b'] = 20
original_dict['c'] = 30

print("After modifying original dict:") 
print("Original dict:", original_dict) # {'a': 10, 'b': 20, 'c': 30}
print("New dict:", new_dict) # {'a': 1, 'b': 2, 'c': 3}

So, when you move a back to CPU, it does not affect residence of weight, because weight contains only copied values of a that were on GPU at the time of copy.

answered 21 hours ago

FoxWise

831 silver badge9 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

xdurch0 21 hours ago

I would be careful with statements like this. Most objects in Python are actually passed via reference, i.e. the same object. You cannot generalize to complex cases like Pytorch from the integer case. In OP's case, the state_dict() function indeed creates copies, however.

Apichet Janya 18 hours ago

I think there’s a small clarification here. The loop isn’t what creates copies — state_dict() itself returns new tensors (a snapshot). Those tensors won’t move when the module moves devices, so the ones stored in weight stay on CUDA. Just adding this for completeness.

simon · Accepted Answer · 2025-11-21 15:10:01Z

In your code, I altered the printouts a bit, to visualize a bit better (at least in my opinion) what's going on:

import torch.nn as nn

class A(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc = nn.Linear(10, 5)
        self.relu = nn.ReLU()

    def forward(self, x):
        return self.relu(self.fc(x))
    
a = A().to('cuda')
print("\nfrom `A().to('cuda')`")
print(f"{id(a.fc.weight)=}, {a.fc.weight.data_ptr()=}")
print(f"{id(a.fc.bias)=}, {a.fc.bias.data_ptr()=}")
# from `A().to('cuda')`
# id(a.fc.weight)=138293720716624, a.fc.weight.data_ptr()=138293368850432
# id(a.fc.bias)=138293720716720, a.fc.bias.data_ptr()=138293368850944

weight = {}
for key, value in a.state_dict().items():
    weight[key] = value
print("\nfrom `weight`")
for key, value in weight.items():
    print(f"{key}, {id(value)=}, {value.data_ptr()=}")
# from `weight`
# fc.weight, id(value)=138293720716816, value.data_ptr()=138293368850432
# fc.bias, id(value)=138293720716528, value.data_ptr()=138293368850944

a.to('cpu')
print("\nfrom `a.to('cpu')`")
print(f"{id(a.fc.weight)=}, {a.fc.weight.data_ptr()=}")
print(f"{id(a.fc.bias)=}, {a.fc.bias.data_ptr()=}")
# from `a.to('cpu')`
# id(a.fc.weight)=138293720716624, a.fc.weight.data_ptr()=101884008832832
# id(a.fc.bias)=138293720716720, a.fc.bias.data_ptr()=101884008983616

If you compare the first two blocks of printouts, (the one from A().to('cuda') and the one from the weight dict), you get:

# from `A().to('cuda')`
# id(a.fc.weight)=138293720716624, a.fc.weight.data_ptr()=138293368850432
# id(a.fc.bias)=138293720716720, a.fc.bias.data_ptr()=138293368850944

# from `weight`
# fc.weight, id(value)=138293720716816, value.data_ptr()=138293368850432
# fc.bias, id(value)=138293720716528, value.data_ptr()=138293368850944

The IDs are different, but the data pointers are the same. What this means: the weight dict contains shallow copies of the tensors in a ("copies" because they have a new ID, "shallow" because they point to the same memory; namely the one on the GPU). This is in line with state_dict(), which you are using to produce the weight dict, and which is documented to produce shallow copies.

If you compare the first and last block of printouts, (the one from A().to('cuda') and the one from a.to('cpu'), you have the opposite situation:

# from `A().to('cuda')`
# id(a.fc.weight)=138293720716624, a.fc.weight.data_ptr()=138293368850432
# id(a.fc.bias)=138293720716720, a.fc.bias.data_ptr()=138293368850944

# from `a.to('cpu')`
# id(a.fc.weight)=138293720716624, a.fc.weight.data_ptr()=101884008832832
# id(a.fc.bias)=138293720716720, a.fc.bias.data_ptr()=101884008983616

The IDs are the same, but the data pointers are different. What this means: the parameters of your model a (a.fc.weight and a.fc.bias) still refer to the same tensor objects (same IDs), but in the meantime, the tensor's underlying memory has been replaced (different memory pointers; namely, now pointing to the CPU memory).

Your code ends with

print("a.state_dict() device:", [t.device for t in a.state_dict().values()])  # in CPU
print("weight device:", [t.device for t in weight.values()])  # still in GPU

So you are comparing

the tensors of your model a's parameters (or rather, a new shallow copy of them, since you call a.state_dict() once more), the memory of which, by now, has been moved to the CPU, with
the shallow copies from earlier on (items in weight dict, which result from your first call of a.state_dict()), whose memory still resides on the GPU.

Collectives™ on Stack Overflow

pytorch Module B=A, A.to('cpu'), but the tensor in B is still in GPU, why?

2 Answers 2

2 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related