Does Modifying an Attribute of a Custom Dataset Affect Both Subsets After random_split in PyTorch?

Ask Question

Asked 8 months ago

Modified 8 months ago

Viewed 15 times

I am working on a binary classification task using an audio dataset, which is already divided into training and testing sets. However, I also need a validation set, so I split the training set into training and validation subsets.

I have created a custom PyTorch Dataset class (CustomAudioDataset) that takes a transform argument, which is a list of Compose objects. The first transformation applies augmentations to raw audio, while the second one applies augmentations to the Mel spectrogram:

audio_transforms = T.Compose([
    T.RandomApply([AddGaussianNoise()], p=0.5),
    T.RandomApply([TimeShift()], p=0.5),
    T.RandomApply([PitchShift(SAMPLE_RATE)], p=0.5)
])

mel_transforms = T.Compose([
    T.RandomApply([FrequencyMasking()], p=0.5),
    T.RandomApply([TimeMasking()], p=0.5),
    T.RandomApply([TimeStretch()], p=0.5),
    T.Normalize(mean=[0], std=[1])
])

Here’s a simplified version of my CustomAudioDataset class:

class CustomAudioDataset(Dataset):
    def __init__(self, parent_directory, transform=None):
        self.parent_directory = parent_directory
        self.transform = transform
        self.audio_files = []
        self.labels = []

        for label in ['0', '1']:
            directory = os.path.join(parent_directory, label)
            for file_name in os.listdir(directory):
                if file_name.endswith('.wav'):
                    self.audio_files.append(os.path.join(directory, file_name))
                    self.labels.append(int(label))

    def __len__(self):
        return len(self.audio_files)

    def __getitem__(self, idx):
        audio_path = self.audio_files[idx]
        label = self.labels[idx]
        audio, sr = torchaudio.load(audio_path)

        if self.transform and len(self.transform) > 0:
            audio = self.transform[0](audio)  # Apply raw audio augmentation
        
        audio = self.pad_audio(audio)
        features = self.extract_features(audio)
        
        if self.transform and len(self.transform) > 1:
            features = self.transform[1](features)  # Apply Mel spectrogram augmentation

        return features, label

To create the train-validation split, I do the following:

train_val_dataset = CustomAudioDataset(
    parent_directory="...", 
    transform=[audio_transforms, mel_transforms]
)

train_size = int(0.75 * len(train_val_dataset))
val_size = len(train_val_dataset) - train_size

train_dataset, val_dataset = random_split(train_val_dataset, [train_size, val_size])
val_dataset.dataset.transform = []  # Disable transformations for validation

My Question:
When I set val_dataset.dataset.transform = [] to disable transformations for the validation dataset, does this work or not? If this works then does it affect only the validation dataset, or does it also impact the training dataset (train_dataset)?

asked Mar 1 at 4:52

GauravGiri

214 bronze badges

Add a comment |

0 Your Answer

Sign up or log in

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.

Collectives™ on Stack Overflow

Does Modifying an Attribute of a Custom Dataset Affect Both Subsets After random_split in PyTorch?

0

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Your Answer

Sign up or log in

Post as a guest