Why does MATLAB selfAttentionLayer give different parameter counts for head/key-channel pairs with the same total key dimension?

I’m experimenting with the MathWorks example that inserts a multi-head self-attention layer into a simple CNN for the DigitDataset:

Link to example

layers = [
    imageInputLayer([28 28 1])
    convolution2dLayer(3,32,'Padding','same')
    batchNormalizationLayer
    reluLayer
    maxPooling2dLayer(2,'Stride',2)
    convolution2dLayer(3,64,'Padding','same')
    batchNormalizationLayer
    reluLayer
    maxPooling2dLayer(2,'Stride',2)
    flattenLayer
    selfAttentionLayer(NUM_HEADS, NUM_KEY_CHANNELS)   % <— point of interest
    fullyConnectedLayer(10)
    softmaxLayer
    classificationLayer];

When I change only the self-attention hyperparameters, I get different number of learnable parameters reported by MATLAB in both cases below:

Case 1: NumHeads = 4, NumKeyChannels = 784

Case 2: NumHeads = 8, NumKeyChannels = 392

In both cases the product NumHeads × NumKeyChannels = 3136, so I expected the number of learnable parameters to be the same. However, MATLAB reports different parameter counts.

My understanding from research papers is that the total parameterization of Q/K/V projections should scale with the total key dimension, not with how it is split across heads

Why does MATLAB’s selfAttentionLayer produce different parameter counts for these two configurations? Am I misinterpreting how the layer is implemented in this toolbox?

I’d appreciate any clarification on how MATLAB calculates the number of parameters here, especially since this isn’t clearly documented in the Deep Learning Toolbox.

edited Aug 26 at 13:46

Wolfie

30.7k7 gold badges30 silver badges60 bronze badges

asked Aug 26 at 12:51

Hend mahmoud

11 bronze badge

Could you please edit the post to add a link to the example you're referring to?

Adriaan
– Adriaan

2025-08-26 12:58:12 +00:00
Commented Aug 26 at 12:58

Add a comment |

0 Your Answer

Sign up or log in

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.

Collectives™ on Stack Overflow

Why does MATLAB selfAttentionLayer give different parameter counts for head/key-channel pairs with the same total key dimension?

0

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Your Answer

Sign up or log in

Post as a guest