Paper summary: Attention is all you need

December 09, 2025

This summary is part of the series "Advent of LLM history". Throughout this month I will add (daily) short summaries of important and influential papers, highlighting their key findings and during that create an Anki set to remember the most important facts. See all posts here

Summary of "Attention is all you need"¶

This post summarizes the key points [^1] from one of the most famous and frequently cited papers, introducing the Transformer architecture: Attention is all you need ^[1]

Link to the paper: https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf

Architecture¶

The proposed Transformer architecture consists of two parts:

Encoder¶

stack of $N=6$ identical layers
- each layer has two sub-layers with residual connections around them
  - a multi-head self-attention
  - fully connected feed-forward network
- after each sub-layer, layer normalization is applied
- at every step the same dimension $d_{\text{model}}=512$ is used

Decoder¶

stack of $N=6$ identical layers
- each layer has the same two sublayers as the encoder block
- add cross-attention layer in the middle that performs MHA over the output of the encoder stack (queries from previous decoder block, but keys and values from output of encoder)
- self-attention block is modified to prevent positions from attending to subsequent positions

Scaled Dot-Product attention¶

Attention takes queries, keys, and values and calculates an output vector as a weighted sum of values
while queries and keys must have the same dimension $d_k$, values can have a different dimension $d_v$
the dot-product of queries $Q$ and keys $K^\top$ is scaled by $1/\sqrt{d_k}$
Why? To stabilizes gradients: for large values of $d_k$, the dot products grow large in magnitude, pushing the softmax function into regions where it has extremely small gradients.
- Great illustration: if $q$ and $k$ are vectors with mean $0$ and variance $1$, their dot product has mean $0$ and variance $d_k$

Multi-Head Attention¶

instead of having a single attention operation, there exist $h$ different instances, each with separately learned $W^Q$, $W^K$ and $W^O$
Each head has dimension of $d_{\text{model}}/h$, which keeps overall complexity independent from number of heads $h$

Position-wise Feed-Forward Networks¶

simple two-layer MLP with ReLU activation in between
Consists of an up-scaling to $d_{ff}=2048$ and down-scaling back to $d_{\text{model}}=512$

Embedding and Output Linear Transformation¶

they share the same weight matrix between the two embedding layers and the pre-softmax linear transformation (weight tying)
Crucial detail: In the embedding layers, the weights are multiplied by $\sqrt{d_{\text{model}}}$

Positional encoding¶

Since the transformer is a set operation, where a shuffling of the input tokens would result in a shuffling of the same output tokens, positional encodings are required
They use sine and cosine functions of different frequencies:
- $PE(pos, 2i)=\sin (pos / 10000^{2i/d_{model}})$
- $PE(pos, 2i+1)=\cos (pos / 10000^{2i/d_{model}})$
- where $pos$ is position and $i$ is dimension
Even though learned positional embeddings are similarly good in performance, they hypothesize that sinusoidal embeddings allow the model to extrapolate

Why self-attention?¶

While recurrent layers require $O(n)$ operations to connect all positions in a sequence, self-attention is faster as long as the sequence length $n$ is smaller than the model dimension $d$ (usually the case at that time, but nowadays context windows often exceed model dimensions)
side benefit is better interpretability due to attention scores

Training¶

4.5M and 36M sentence pairs
vocab size of 37,000 tokens
batch size of 25k tokens
training for 100k steps with 4k warm-up steps and decay proportional to inverse square root of step number
dropout for output of each sub-layer, before it is added to the residual input and normalized
additional dropout on sums of the embeddings and positional encodings in both encoder and decoder stacks
label smoothing hurts perplexity (since it increases uncertainty), but increases accuracy and BLEU score
for small models they average the last 5 checkpoints (10 minutes intervals), large models are average of last 20 checkpoints

What I learned?¶

weight tying was already proposed in the original Transformer paper (together with an embedding down-scaling)
the reason for the factor $1/\sqrt{d_\text{model}}$ in the attention calculation is to avoid close to zero gradients due to oversaturated softmax
When the sequence length exceeds the model dimension, self-attention becomes more expensive than a recurrent layer
label-smoothing increased accuracy, but hurts perplexity
they use checkpoint averaging

Anki cards¶

Below are some of the key facts I want to remember in Anki format.

Question: Where is normalization positioned in the "Attention is all you need" paper?

Answer

Post-norm, after each sub-layer

Question: What is the model dimension $d_\text{model}$ in the Transformer of "Attention is all you need"?

Answer

512

Question: "Attention is all you need": What are the sub-layers of an encoder block?

Answer

Self-attention (MHA) and FFN

Question: "Attention is all you need": What are the sub-layers of a decoder block?

Answer

Masked-MHA, Cross-Attention and FFN

Question: "Attention is all you need": Where do queries, keys and values come from in a cross-attention block?

Answer

`queries` from previous decoder block, `keys` and `values` from last encoder output

Question: "Attention is all you need": Which vectors can have the same dimensions, which can be different (query, key and value)?

Answer

`query` and `key` have to be the same, `value` can be different

Question: "Attention is all you need": Why do we need the factor of $1/\sqrt{d_\text{model}}$ in the attention calculation?

Answer

Without most gradients would get close to zero for larger $d_\text{model}$

Question: "Attention is all you need": What is the analogy for the reason for $1/\sqrt{d_\text{model}}$?

Answer

If $q$ and $k$ have mean $0$ and variance 1, their dot product has variance $d_\text{model}$

Question: "Attention is all you need": What are the dimensions in the FFN ($d_{\text{model}}$ and $d_{ff}$)

Answer

$d_{\text{model}}=512$ and $d_{ff}=2048$

Question: "Attention is all you need": What is special about embedding and lm_head matrices?

Answer

They are shared, but embedding matrix is scaled by $\sqrt {d_{\text{model}}}$

Question: "Attention is all you need": Why do the authors prefer sinusoidal positional encoding, even though a learned one performs similarly well?

Answer

Sinusoidal might extrapolate

Question: "Attention is all you need": Under what condition is a self-attention layer computationally faster than a recurrent layer?

Answer

When the sequence length $n$ is smaller than the representation dimension $d$. ($n < d$)

Question: "Attention is all you need": Label smoothing hurts one metric, but improves another. Which ones?

Answer

Hurts perplexity, but improves accuracy and BLEU

Question: "Attention is all you need": How is the final model after training constructed?

Answer

Average of last 5 (20 for big models) checkpoints

Question: "Attention is all you need": How long do their base models train (in GPU hours)?

Answer

96 (12h on 8 P100)

References

Attention is all you need
Advances in neural information processing systems, 2017 ↩