Pytorch model output is not correct (torch.float32 and torch.float64)

Question

I have created a DNN model with Pytorch (input_dim=6, output_dim=150). Normally, if I generate a random X_in=torch.randn(6000, 6), it will return me a model_out.shape=(6000, 150), and if I calculate the Rank of model_out, it should be 150 (since my model's weight and bias are also randomly initialised).

However, you can see this is NOT TRUE with the following code:

import torch
import torch.nn as nn

torch.manual_seed(923) # for reproducible result

class MyDNN(nn.Module):
    def __init__(self):
        super(MyDNN, self).__init__()
        # layer 0:
        self.linear_0 = nn.Linear(6, 150)
        self.activ_0 = nn.Tanh()
        # layer 1:
        self.linear_1 = nn.Linear(150, 150)
        self.activ_1 = nn.Tanh()
        # layer 2:
        self.linear_2 = nn.Linear(150, 150)
        self.activ_2 = nn.Tanh()
        # layer 3:
        self.linear_3 = nn.Linear(150, 150)
        self.activ_3 = nn.Tanh()

    def forward(self, x):
        out = self.activ_0(self.linear_0(x)) # output: layer 0
        out = self.activ_1(self.linear_1(out)) # output: layer 1
        out = self.activ_2(self.linear_2(out)) # output: layer 2
        out = self.activ_3(self.linear_3(out)) # output: layer 3
        return out

model = MyDNN()
X_in = torch.randn(6000, 6, dtype=torch.float32)
with torch.no_grad():
    model_out = model(X_in)
print(f'model_out rank = {torch.linalg.matrix_rank(model_out)}')

model_out rank = 115. Apparently this is a WRONG output, there is no way that the output has so many linear dependent columns with all the inputs, weights and bias are randomly initialised!

This problem can be solved by changing the X_in dtype as well as the model dtype to float64 with the following code:

model_64 = MyDNN()
model_64.double()
X_in_64 = torch.randn(6000, 6, dtype=torch.float64)
with torch.no_grad():
    model_64_out = model_64(X_in_64)
print(f'model_64_out rank = {torch.linalg.matrix_rank(model_64_out)}')

model_64_out rank = 150

Here is my question:

Why does this happen? Is this really a problem of data size? I mean float32 already has a good precision. Actually when I use my own training_data, even with mini_batch_size = 10 -> output.shape = (10, 150), my Rank(output) is less than 10.
Although this problem can be solved by using double precision, this slows down the whole training process a lot (and with Mac M1 pro GPU, it only supports float32 type). Is there any other solution?

flawr · Accepted Answer · 2022-09-03 21:29:31Z

1

You have to realize that we are dealing with a numerical problem here: The rank of a matrix is a discrete value derived from a e.g. a singular value decomposition in the case of torch.matrix_rank. In this case we need to consider a threshold on the singular values: At what modulus tol do we consider a singular value as exactly zero?

Remember that we are dealing with floating point values where all operations always comes with truncation and rounding errors. In short there is no sense in trying to compute an exact rank.

So instead you might reconsider what kind of tolerance you use, you could e.g. use torch.linal.matrix_rank(..., tol=1e-6). The smaller the tolerance, the higher the expected rank.

But no matter what kind of floating point precision you use, I'd argue you will never be able to find meaningful "exact" number for the rank, it will always be a trade off! Therefore I'd reconsider whether you really need to compute the rank in the first place, or wether there is some other kind of criterion that is better suited for numerical considerations in the first place!

answered Sep 3, 2022 at 21:29

flawr

11.7k4 gold badges49 silver badges83 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

HelloKT Over a year ago

Thanks, this precisely answered my question!

Collectives™ on Stack Overflow

Pytorch model output is not correct (torch.float32 and torch.float64)

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related