L1/L2 regularization in PyTorch

Question

How do I add L1/L2 regularization in PyTorch without manually computing it?

Mateen Ulhaq · Accepted Answer · 2022-07-11 08:36:19Z

108

Use weight_decay > 0 for L2 regularization:

optimizer = torch.optim.Adam(model.parameters(), lr=1e-4, weight_decay=1e-5)

edited Jul 11, 2022 at 8:36

Mateen Ulhaq

27.8k21 gold badges121 silver badges155 bronze badges

answered Oct 6, 2017 at 2:47

devil in the detail

3,3651 gold badge20 silver badges15 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Ashish Over a year ago

In SGD optimizer, L2 regularization can be obtained by weight_decay. But weight_decay and L2 regularization is different for Adam optimizer. More can be read here: openreview.net/pdf?id=rk6qdGgCZ

Eric Wiener Over a year ago

@Ashish your comment is correct that weight_decay and L2 regularization is different but in the case of PyTorch's implementation of Adam, they actually implement L2 regularization instead of true weight decay. Note that the weight decay term is applied to the gradient before the optimizer step here

Guojun Zhang Over a year ago

No.... that's not L2

Mateen Ulhaq Over a year ago

@GuojunZhang The L2 regularized loss is L = f(θ) + ½λ∑θ². But then, 𝜕L/𝜕θ = 𝜕f/𝜕θ + λ∑θ. If you take a look at the Adam algorithm, it effectively says g = 𝜕L/𝜕θ = 𝜕f/𝜕θ + λ∑θ.

Guojun Zhang Over a year ago

@MateenUlhaq I think the pytorch official implementation is not correct. As stated in the AdamW paper (arxiv.org/abs/1711.05101 Prop 2), L2 reg is not equal to weight decay for adaptive gradients

|

Mateen Ulhaq · Accepted Answer · 2022-07-11 08:39:34Z

91

See the documentation. Add a weight_decay parameter to the optimizer for L2 regularization.

edited Jul 11, 2022 at 8:39

Mateen Ulhaq

27.8k21 gold badges121 silver badges155 bronze badges

answered Mar 10, 2017 at 16:46

Kashyap

6,7592 gold badges25 silver badges21 bronze badges

5 Comments

Wasi Ahmad Over a year ago

Adagrad is an optimization technique, I am talking about regularization. Can you give me a concrete example with L1 and L2 loss?

Kashyap Over a year ago

Ya, the L2 regularisation is mysteriously added in the Optimization functions because loss functions are used during Optimization. You can find the discussion here discuss.pytorch.org/t/simple-l2-regularization/139/3

dashesy Over a year ago

I have some branches using L2 loss, so this is not useful. (I have different loss functions)

mrgloom Over a year ago

What if I want use L1 or some other loss for regularization?

Eric Wiener Over a year ago

@mrgloom you can implement that yourself. It is not included with the optimizers.

Szymon Maszke · Accepted Answer · 2025-02-22 17:30:46Z

57

+100

Previous answers, while technically correct, are inefficient performance wise and are not too modular (hard to apply on a per-layer basis, as provided by, say, keras layers).

PyTorch L2 implementation

Why PyTorch implemented L2 inside torch.optim.Optimizer instances?

Let's take a look at torch.optim.SGD source code (currently as functional optimization procedure), especially this part:

for i, param in enumerate(params):
    d_p = d_p_list[i]
    # L2 weight decay specified HERE!
    if weight_decay != 0:
        d_p = d_p.add(param, alpha=weight_decay)

One can see, that d_p (derivative of parameter, gradient) is modified and re-assigned for faster computation (not saving the temporary variables)
It has O(N) complexity without any complicated math like pow
It does not involve autograd extending the graph without any need

Compare that to O(n) **2 operations, addition and also taking part in backpropagation.

Math

Let's see L2 equation with alpha regularization factor (same could be done for L1 ofc):

If we take derivative of any loss with L2 regularization w.r.t. parameters w (it is independent of loss), we get:

So it is simply an addition of alpha * weight for gradient of every weight! And this is exactly what PyTorch does above!

L1 Regularization layer

Using this (and some PyTorch magic), we can come up with quite generic L1 regularization layer, but let's look at first derivative of L1 first (sgn is signum function, returning 1 for positive input and -1 for negative, 0 for 0):

Full code with WeightDecay interface located in torchlayers third party library providing stuff like regularizing only weights/biases/specifically named paramters (disclaimer: I'm the author), but the essence of the idea outlined below (see comments):

class L1(torch.nn.Module):
    def __init__(self, module, weight_decay):
        super().__init__()
        self.module = module
        self.weight_decay = weight_decay

        # Backward hook is registered on the specified module
        self.hook = self.module.register_full_backward_hook(self._weight_decay_hook)

    # Not dependent on backprop incoming values, placeholder
    def _weight_decay_hook(self, *_):
        for param in self.module.parameters():
            # If there is no gradient or it was zeroed out
            # Zeroed out using optimizer.zero_grad() usually
            # Turn on if needed with grad accumulation/more safer way
            # if param.grad is None or torch.all(param.grad == 0.0):

            # Apply regularization on it
            param.grad = self.regularize(param)

    def regularize(self, parameter):
        # L1 regularization formula
        return self.weight_decay * torch.sign(parameter.data)

    def forward(self, *args, **kwargs):
        # Simply forward and args and kwargs to module
        return self.module(*args, **kwargs)

Read more about hooks in this answer or respective PyTorch docs if needed.

And usage is also pretty simple (should work with gradient accumulation and PyTorch layers):

layer = L1(torch.nn.Conv2d(in_channels=3, out_channels=32, kernel_size=3))

edited Feb 22 at 17:30

answered Mar 14, 2021 at 22:43

Szymon Maszke

25.2k4 gold badges54 silver badges92 bronze badges

10 Comments

Maxim Egorushkin Over a year ago

Would you like to make a new tag for torchlayers and release it with L1 and L2 because they are still missing in version 0.1.1 released more that 1 year ago?

Szymon Maszke Over a year ago

@MaximEgorushkin could you try the nightly release? It should be there although not thoroughly tested as of yet, new release is planned in the upcoming 2 months (together with other libraries)

Maxim Egorushkin Over a year ago

Nightly has L1 and L2, thank you. There is a warning though

~/anaconda3/envs/torch/lib/python3.8/site-packages/torch/nn/modules/module.py:785: UserWarning: Using a non-full backward hook when outputs are generated by different autograd Nodes is deprecated and will be removed in future versions. This hook will be missing some grad_output. Please use register_full_backward_hook to get the documented behavior.

Szymon Maszke Over a year ago

@phydev - yes, was inteded as more of a sidenote for neural network specifically (not for lasso regression), will remove it as it might be confusing for users and not substantiated by any source. See Kevin Yin comment above as it is somewhat related.

phydev Over a year ago

I think your side note is important actually, just wanted to clarify. I tried different implementations of L1-regularization in pytorch and I thought that they were wrong because I couldn't reach sparsity, so it was good to come across your answer. It's not clear to me why this happens thought, but I suspect it is related with the minimisation procedure that is quite different from Lasso. Thanks for your great answer btw, I have been using your L1 layer.

|

End genocide - save Gaza · Accepted Answer · 2022-08-05 12:56:14Z

30

For L2 regularization,

l2_lambda = 0.01
l2_reg = torch.tensor(0.)

for param in model.parameters():
    l2_reg += torch.norm(param)

loss += l2_lambda * l2_reg

References:

edited Aug 5, 2022 at 12:56

End genocide - save Gaza

25k10 gold badges113 silver badges133 bronze badges

answered Apr 30, 2018 at 12:04

Sherif Ali

4174 silver badges3 bronze badges

4 Comments

Girishkumar Over a year ago

Shouldn't one need to exclude non-trainable parameters?

John Liu Over a year ago

torch.norm is taking 2-norm here, not the square of the 2-norm. So I think the norm should be squared to get a correct regularization.

cswu Over a year ago

without requires_grad and use += would cause error. This works for me: l2_reg = torch.tensor(0., requires_grad=True) l2_reg = l2_reg + torch.norm(param)

End genocide - save Gaza Over a year ago

Warning: torch.norm is deprecated.

Dharman · Accepted Answer · 2025-05-15 11:55:59Z

28

L2 regularization out-of-the-box

Yes, pytorch optimizers have a parameter called weight_decay which corresponds to the L2 regularization factor:

sgd = torch.optim.SGD(model.parameters(), weight_decay=weight_decay)

L1 regularization implementation

There is no analogous argument for L1, however this is straightforward to implement manually:

loss = loss_fn(outputs, labels)
l1_lambda = 0.001
l1_norm = sum(torch.linalg.norm(p, 1) for p in model.parameters())

loss = loss + l1_lambda * l1_norm

The equivalent manual implementation of L2 would be:

l2_reg = sum(p.pow(2).sum() for p in model.parameters())

Source: Deep Learning with PyTorch (8.5.2)

edited May 15 at 11:55

Dharman♦

33.9k27 gold badges105 silver badges157 bronze badges

answered Mar 9, 2021 at 8:48

End genocide - save Gaza

25k10 gold badges113 silver badges133 bronze badges

Comments

David · Accepted Answer · 2022-11-08 13:32:36Z

18

for L1 regularization and include weight only:

l1_reg = torch.tensor(0., requires_grad=True)

for name, param in model.named_parameters():
    if 'weight' in name:
        l1_reg = l1_reg + torch.linalg.norm(param, 1)

total_loss = total_loss + 10e-4 * l1_reg

edited Nov 8, 2022 at 13:32

David

4081 gold badge5 silver badges16 bronze badges

answered Oct 24, 2019 at 2:41

oukohou

4014 silver badges9 bronze badges

1 Comment

End genocide - save Gaza Over a year ago

Warning: torch.norm is deprecated.

prosti · Accepted Answer · 2019-09-30 14:06:28Z

6

Interesting torch.norm is slower on CPU and faster on GPU vs. direct approach.

import torch
x = torch.randn(1024,100)
y = torch.randn(1024,100)

%timeit torch.sqrt((x - y).pow(2).sum(1))
%timeit torch.norm(x - y, 2, 1)

Out:

1000 loops, best of 3: 910 µs per loop
1000 loops, best of 3: 1.76 ms per loop

On the other hand:

import torch
x = torch.randn(1024,100).cuda()
y = torch.randn(1024,100).cuda()

%timeit torch.sqrt((x - y).pow(2).sum(1))
%timeit torch.norm(x - y, 2, 1)

Out:

10000 loops, best of 3: 50 µs per loop
10000 loops, best of 3: 26 µs per loop

edited Sep 30, 2019 at 14:06

answered May 2, 2019 at 11:36

prosti

46.9k19 gold badges199 silver badges161 bronze badges

3 Comments

Muppet Over a year ago

Confirmed that on my end as well. torch.norm is about 60% slower in this example.

seermer Over a year ago

This answer is incorrect, GPU calculations are nonblocking, this means that timeit will not work correctly, because calculations are still in progress on GPU even after CPU (where the timeit happens) takes control. To get the correct timing, you must synchronize before stopping timer.

seermer Over a year ago

If you time it correctly, you will see torch.norm is almost twice as fast as sqrt approach (by using torch.cuda.synchronize before stop timer)

Albert · Accepted Answer · 2023-01-23 11:05:25Z

To extend on the good answers: As it was said, L2 norm added to the loss is equivalent to weight decay iff you use plain SGD without momentum. Otherwise, e.g. with Adam, it is not exactly the same. The AdamW paper [1] pointed out that weight decay is actually more stable. That is why you should use weight decay, which is an option to the optimizer. And consider using AdamW instead of Adam.

Also note, you probably don't want weight decay on all parameters (model.parameters()), but only on a subset. See here for examples:

[1] Decoupled Weight Decay Regularization (AdamW), 2017

Mateen Ulhaq · Accepted Answer · 2024-06-07 01:15:23Z

Proof that `weight_decay` for `torch.optim.Adam` is the L2 regularization coefficient

The L2 regularized loss is:

L = f(θ) + ½λ∑θ²

Then, the derivative (gradient) vector is:

𝜕L/𝜕θ = 𝜕f/𝜕θ + λθ

PyTorch's Adam implementation computes the gradient as:

g = 𝜕L/𝜕θ = 𝜕f/𝜕θ + λθ

...where λ = weight_decay.

Adam algorithm (as used by PyTorch)

Terminology

The later-published AdamW paper uses the term "decoupled weight decay" to refer to a different concept. (Green in the image below.) This concept is different from PyTorch's weight_decay. (Pink in the image below.)

AdamW algorithm (from paper)

Note that the original Adam paper does not explicitly mention L2 regularization as is included by PyTorch. Presumably, this is because L2 is easy enough to implement outside the optimizer:

parameters = [g["params"] for g in optimizer.param_groups]
l2 = sum(p.square().sum() for p in parameters)
loss = mse(...) + weight_decay * l2

Collectives™ on Stack Overflow

L1/L2 regularization in PyTorch

9 Answers 9

6 Comments

5 Comments

PyTorch L2 implementation

Math

L1 Regularization layer

10 Comments

4 Comments

L2 regularization out-of-the-box

L1 regularization implementation

Comments

1 Comment

3 Comments

Comments

Proof that `weight_decay` for `torch.optim.Adam` is the L2 regularization coefficient

Terminology

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

9 Answers 9

6 Comments

5 Comments

PyTorch L2 implementation

Math

L1 Regularization layer

10 Comments

4 Comments

L2 regularization out-of-the-box

L1 regularization implementation

Comments

1 Comment

3 Comments

Comments

Proof that weight_decay for torch.optim.Adam is the L2 regularization coefficient

Terminology

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related

Proof that `weight_decay` for `torch.optim.Adam` is the L2 regularization coefficient