Why does my PyTorch DataLoader only use one CPU core despite setting num_workers>1?

Ask Question

Asked 11 months ago

Modified 11 months ago

Viewed 332 times

Part of NLP Collective

I am trying to fine-tune BERT for a multi-label classification task (Jigsaw toxic comments). I created a custom dataset and DataLoader as follows:

class CustomDataSet(Dataset):
    
    def __init__(self,
                 features: np.ndarray,
                 labels: np.ndarray,
                 token_max: int,
                 tokenizer):
    
        self.features = features
        self.labels = labels  
        self.tokenizer = tokenizer
        self.token_max = token_max
        
    def __len__(self):
        return len(self.features)

    def __getitem__(self,
                    index: int):
        comment_id, comment_text = self.features[index]
        labels = self.labels[index]

        encoding = self.tokenizer.encode_plus(
            comment_text,
            add_special_tokens=True,
            max_length=self.token_max,
            return_token_type_ids=False,
            padding='max_length',
            truncation=True,
            return_attention_mask=True,
            return_tensors='pt')

        return dict(
            comment_text=comment_text,
            comment_id=comment_id,
            input_ids=encoding['input_ids'].squeeze(0),
            attention_mask=encoding['attention_mask'].squeeze(0),
            labels=torch.Tensor(labels))

and I use the following tokenizer:

tokenizer = BertTokenizer.from_pretrained('bert-base-cased')

then I create my dataset using the custom class, and the corresponding dataloader:

train_dataset = CustomDataSet(X_train, y_train, tokenizer=tokenizer, token_max=256)
train_loader = DataLoader(
    train_dataset, batch_size=32, shuffle=True, pin_memory=True, num_workers=16, persistent_workers=False
)

My model is the following:

class MultiLabelBERT(torch.nn.Module):
    def __init__(self, num_labels):
        super(MultiLabelBERT, self).__init__()
        self.bert = BertModel.from_pretrained("bert-base-uncased", torch_dtype=torch.float16, attn_implementation="sdpa")
        self.classifier = torch.nn.Linear(self.bert.config.hidden_size, num_labels)
        self.classifier = self.classifier.to(torch.float16)

    def forward(self, input_ids, attention_mask):
        outputs = self.bert(input_ids=input_ids, attention_mask=attention_mask)
        pooled_output = outputs.pooler_output 
        logits = self.classifier(pooled_output)
        return logits

first_BERT = MultiLabelBERT(6)

I first want to iterate through all batches of my training set and perform a forward pass:

for batch_idx, item  in enumerate(train_loader):
    input_ids = item['input_ids'].to(self.device)
    attention_mask = item['attention_mask'].to(self.device)
    labels = item['labels'].to(self.device)
    logits = first_BERT(input_ids=input_ids,
                        attention_mask=attention_mask)

Even though I set num_workers=16 in my DataLoader, it only uses one CPU core to load data onto my GPU. This significantly slows down the process. Here’s what I’ve tried:

Reducing the batch size.
Reducing the number of tokens (token_max).
Tokenizing the entire dataset beforehand to ensure the tokenizer isn’t causing the bottleneck issue. When I comment out the forward pass, the DataLoader uses all CPU workers as expected. However, with the forward pass, the process seems to be bottlenecked. I have the following gpu setup:

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.120                Driver Version: 550.120        CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4060 ...    Off |   00000000:01:00.0 Off |                  N/A |
| N/A   40C    P0            588W /  115W |       9MiB /   8188MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A      1285      G   /usr/lib/xorg/Xorg                              4MiB |

Has anyone an idea of what could cause this?

asked Dec 26, 2024 at 13:21

Hyppolite

675 bronze badges

have you figured out how to utilize all CPU cores for training?

Sengiley
– Sengiley

2025-05-04 12:37:20 +00:00
Commented May 4 at 12:37
Unfortunately, I did not. I remember trying different solutions but none made it.

Hyppolite
– Hyppolite

2025-05-05 13:29:32 +00:00
Commented May 5 at 13:29
I had same issue here, and to this day the only solution i found was choosing an older pytorch version. basically downgrading from pytorch 2.7 to 2.6 worked. also make sure the cuda versions matche

eulerleibniz
– eulerleibniz

2025-07-16 02:54:12 +00:00
Commented Jul 16 at 2:54

Add a comment |

0 Your Answer

Sign up or log in

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.

Collectives™ on Stack Overflow

Why does my PyTorch DataLoader only use one CPU core despite setting num_workers>1?

0

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Your Answer

Sign up or log in

Post as a guest