Pytorch Dataloader adding a batch dimension

Question

I think this question was already asked a few times but I am yet to find a good answer here.

So I have a Pytorch Dataset that is made from 2 numpy arrays.

The following are the dimensions.

features = [10000, 450, 28] numpy array. dim_0 = the number of sample, dim_1 = time series, dim_2 = features. Basically I have a data that is 450 frames long, where each frame contains 28 features and I have 10000 samples.

label = [10000,450] numpy array. dim_0 = number of samples, dim_1 = label per each frame.

The assignment is that I need to do a classification for each frame.

I created a Pytorch custom Dataset and Dataloader using the following function.

label_length = label.size
label = torch.from_numpy(label)
features = torch.from_numpy(features)

train_dataset = Dataset(label, features, label_length)

train_dataloader = DataLoader(train_dataset, batch_size=64, shuffle=True)

As expected, the train_dataloader.dataset.data returns a tensor of size [10000,450,28] Great! Now just need to take the batches from the 10000 sample and loop! So I run a code like below - assume that the optimizers/loss function are all set.

train_loss = 0
EPOCHS = 3
for epoch_idx in range(EPOCHS):
    for i, data in enumerate(train_dataloader):
        inputs, labels = data
        print(inputs.size())
        optimizer.zero_grad()
        outputs = model(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
        train_loss += loss.item()

But I get this error:

ValueError: LSTM: Expected input to be 2D or 3D, got 4D instead

When I checked the dimension of inputs, it gave [64 x 10000 x 450 x 28]

Why does dataloader add this dimension of batch? (I understand per documentation it is supposed to do it, but I think it should take 64 samples out of 10000 and create batches and loop over each batch?

I think I am making a mistake somewhere but cannot pin point what I am doing wrong...

EDIT: This is my simple Dataset class

class Dataset(torch.utils.data.Dataset):
    def __init__(self, label, data, length):
        self.labels = label
        self.data = data
        self.length = length

    def __len__(self):
        return self.length

    def __getitem__(self, idx):
        # need to create tensor
        #data = torch.from_numpy(self.data)
        #labels = torch.from_numpy(self.labels).type(torch.LongTensor)
        data = self.data
        labels = self.labels
        return data, labels

Lelouch · Accepted Answer · 2024-04-15 12:24:59Z

1

Dataloader adds a batch dimension, it is one of the purposes of the dataloader. And most pytorch function/layers expect a batched input too.

The issue there is your __getitem__ function that should return only ONE sample, not the whole dataset. You need to use the idx argument and return something like data[idx,:,:] and labels[idx, :] instead of the whole data, labels arrays. Otherwise each batch will contain 64 copies of the whole dataset instead of 64 samples, which is of course not what you want.

By the way, you probably do not need the length argument at it is already contained in the shape of your labels or data (this is the first dimension, 10000).

edited Apr 15, 2024 at 12:24

answered Apr 15, 2024 at 12:19

Lelouch

3031 silver badge5 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Inkyu Kim Over a year ago

Thank you! I did not know how to input pass the idx argument into the getitem . It was already a built in arg!

Collectives™ on Stack Overflow

Pytorch Dataloader adding a batch dimension

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related