Paper summary: Attention is all you need
This summary is part of the series "Advent of LLM history". Throughout this month I will add (daily) short summaries of important and influential papers, highlighting their key findings and during that create an Anki set to remember the most important facts. See all posts here
Summary of "Attention is all you need"¶
This post summarizes the key points [^1] from one of the most famous and frequently cited papers, introducing the Transformer architecture: Attention is all you need [1]
Link to the paper: https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf
Architecture¶
The proposed Transformer architecture consists of two parts:
Encoder¶
- stack of $N=6$ identical layers
- each layer has two sub-layers with residual connections around them
- a multi-head self-attention
- fully connected feed-forward network
- after each sub-layer, layer normalization is applied
- at every step the same dimension $d_{\text{model}}=512$ is used
- each layer has two sub-layers with residual connections around them
Decoder¶
- stack of $N=6$ identical layers
- each layer has the same two sublayers as the encoder block
- add cross-attention layer in the middle that performs MHA over the output of the encoder stack (queries from previous decoder block, but keys and values from output of encoder)
- self-attention block is modified to prevent positions from attending to subsequent positions
Scaled Dot-Product attention¶
- Attention takes queries, keys, and values and calculates an output vector as a weighted sum of values
- while queries and keys must have the same dimension $d_k$, values can have a different dimension $d_v$
- the dot-product of queries $Q$ and keys $K^\top$ is scaled by $1/\sqrt{d_k}$
- Why? To stabilizes gradients: for large values of $d_k$, the dot products grow large in magnitude, pushing the softmax function into regions where it has extremely small gradients.
- Great illustration: if $q$ and $k$ are vectors with mean $0$ and variance $1$, their dot product has mean $0$ and variance $d_k$
Multi-Head Attention¶
- instead of having a single attention operation, there exist $h$ different instances, each with separately learned $W^Q$, $W^K$ and $W^O$
- Each head has dimension of $d_{\text{model}}/h$, which keeps overall complexity independent from number of heads $h$
Position-wise Feed-Forward Networks¶
- simple two-layer MLP with ReLU activation in between
- Consists of an up-scaling to $d_{ff}=2048$ and down-scaling back to $d_{\text{model}}=512$
Embedding and Output Linear Transformation¶
- they share the same weight matrix between the two embedding layers and the pre-softmax linear transformation (weight tying)
- Crucial detail: In the embedding layers, the weights are multiplied by $\sqrt{d_{\text{model}}}$
Positional encoding¶
- Since the transformer is a set operation, where a shuffling of the input tokens would result in a shuffling of the same output tokens, positional encodings are required
- They use sine and cosine functions of different frequencies:
- $PE(pos, 2i)=\sin (pos / 10000^{2i/d_{model}})$
- $PE(pos, 2i+1)=\cos (pos / 10000^{2i/d_{model}})$
- where $pos$ is position and $i$ is dimension
- Even though learned positional embeddings are similarly good in performance, they hypothesize that sinusoidal embeddings allow the model to extrapolate
Why self-attention?¶
- While recurrent layers require $O(n)$ operations to connect all positions in a sequence, self-attention is faster as long as the sequence length $n$ is smaller than the model dimension $d$ (usually the case at that time, but nowadays context windows often exceed model dimensions)
- side benefit is better interpretability due to attention scores
Training¶
- 4.5M and 36M sentence pairs
- vocab size of 37,000 tokens
- batch size of 25k tokens
- training for 100k steps with 4k warm-up steps and decay proportional to inverse square root of step number
- dropout for output of each sub-layer, before it is added to the residual input and normalized
- additional dropout on sums of the embeddings and positional encodings in both encoder and decoder stacks
- label smoothing hurts perplexity (since it increases uncertainty), but increases accuracy and BLEU score
- for small models they average the last 5 checkpoints (10 minutes intervals), large models are average of last 20 checkpoints
What I learned?¶
- weight tying was already proposed in the original Transformer paper (together with an embedding down-scaling)
- the reason for the factor $1/\sqrt{d_\text{model}}$ in the attention calculation is to avoid close to zero gradients due to oversaturated softmax
- When the sequence length exceeds the model dimension, self-attention becomes more expensive than a recurrent layer
- label-smoothing increased accuracy, but hurts perplexity
- they use checkpoint averaging
Anki cards¶
Below are some of the key facts I want to remember in Anki format.
Question: Where is normalization positioned in the "Attention is all you need" paper?
Answer
Post-norm, after each sub-layerQuestion: What is the model dimension $d_\text{model}$ in the Transformer of "Attention is all you need"?
Answer
512Question: "Attention is all you need": What are the sub-layers of an encoder block?
Answer
Self-attention (MHA) and FFNQuestion: "Attention is all you need": What are the sub-layers of a decoder block?
Answer
Masked-MHA, Cross-Attention and FFNQuestion: "Attention is all you need": Where do queries, keys and values come from in a cross-attention block?
Answer
`queries` from previous decoder block, `keys` and `values` from last encoder outputQuestion: "Attention is all you need": Which vectors can have the same dimensions, which can be different (query, key and value)?
Answer
`query` and `key` have to be the same, `value` can be differentQuestion: "Attention is all you need": Why do we need the factor of $1/\sqrt{d_\text{model}}$ in the attention calculation?
Answer
Without most gradients would get close to zero for larger $d_\text{model}$Question: "Attention is all you need": What is the analogy for the reason for $1/\sqrt{d_\text{model}}$?
Answer
If $q$ and $k$ have mean $0$ and variance 1, their dot product has variance $d_\text{model}$Question: "Attention is all you need": What are the dimensions in the FFN ($d_{\text{model}}$ and $d_{ff}$)
Answer
$d_{\text{model}}=512$ and $d_{ff}=2048$Question: "Attention is all you need": What is special about embedding and lm_head matrices?
Answer
They are shared, but embedding matrix is scaled by $\sqrt {d_{\text{model}}}$Question: "Attention is all you need": Why do the authors prefer sinusoidal positional encoding, even though a learned one performs similarly well?
Answer
Sinusoidal might extrapolateQuestion: "Attention is all you need": Under what condition is a self-attention layer computationally faster than a recurrent layer?
Answer
When the sequence length $n$ is smaller than the representation dimension $d$. ($n < d$)Question: "Attention is all you need": Label smoothing hurts one metric, but improves another. Which ones?
Answer
Hurts perplexity, but improves accuracy and BLEUQuestion: "Attention is all you need": How is the final model after training constructed?
Answer
Average of last 5 (20 for big models) checkpointsQuestion: "Attention is all you need": How long do their base models train (in GPU hours)?
Answer
96 (12h on 8 P100)References
-
Attention is all you need
Advances in neural information processing systems, 2017 ↩