How to Implement Full Batch Gradient Descent with Nesterov Momentum in PyTorch?

Question

I'm working on a machine learning project in PyTorch where I need to optimize a model using the full batch gradient descent method. The key requirement is that the optimizer should use all the data points in the dataset for each update. My challenge with the existing torch.optim.SGD optimizer is that it doesn't inherently support using the entire dataset in a single update. This is crucial for my project as I need the optimization process to consider all data points to ensure the most accurate updates to the model parameters.

Additionally, I would like to retain the use of Nesterov momentum in the optimization process. I understand that one could potentially modify the batch size to equal the entire dataset, simulating a full batch update with the SGD optimizer. However, I'm interested in whether there's a more elegant or direct way to implement a true Gradient Descent optimizer in PyTorch that also supports Nesterov momentum.

Ideally, I'm looking for a solution or guidance on how to implement or configure an optimizer in PyTorch that meets the following criteria:

Utilizes the entire dataset for each parameter update (true Gradient Descent behavior).
Incorporates Nesterov momentum for more efficient convergence.
Is compatible with the rest of the PyTorch ecosystem, by subclassing torch.optim.Optimizer

By standard in PyTorch the batch size is defined as an optional parameter in your DataLoader. If you want to make it so that it samples your dataset in one large batch, you'll want to make the batch_size parameter in your DataLoader equal to len(dataset). Here's a related question. — Brock Brown
– Brock Brown, Commented Mar 4, 2024 at 16:17

Felix Zimmermann · Accepted Answer · 2024-03-04 16:25:11Z

2

The pytorch SGD implementation is actually independent of the batching! It only uses the gradients that were calculated and stored in the parameters .grad attribute in the backward pass. So the batch size used for calculations and the batch size used for optimization are decoupled.

You can now either:

a) Put all your samples as one big batch through your model by setting the batchsize to the dataset size or

b) Accumulate the gradients for many smaller batches before doing a single step of the optimizer (Pseudo-code):

model = YourModel()
data = YourDataSetOrLoader()
optim = torch.optim.SGD(model.parameters())
for full_batch_step in range(100)
   #this sets the accumulated gradient to zero
   optim.zero_grad()
   for batch in data:
      f=model(data)
      # this adds the gradient wrt to the parameters for the current datapoint to the model paramters
      f.backward()
  # now after we summed the gradient for all samples, we do a GD step.
  optim.step()

answered Mar 4, 2024 at 16:25

Felix Zimmermann

3921 silver badge7 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Maxou Over a year ago

Ok so how could I do that in a class subclassing optimizer ? With my dataset X as attribute?

Collectives™ on Stack Overflow

How to Implement Full Batch Gradient Descent with Nesterov Momentum in PyTorch?

1 Answer 1

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related