Paper summary: BERT

December 10, 2025

This summary is part of the series "Advent of LLM history". Throughout this month I will add (daily) short summaries of important and influential papers, highlighting their key findings and during that create an Anki set to remember the most important facts. See all posts here

Summary of "BERT: Pre-training of deep bidirectional transformers for language understanding"¶

After yesterday's start of this series with the original Transformer paper "Attention is all you need"^[1], we continue today with basically the same architecture, but bidirectional attention: BERT^[2]. Find out what the benefits are, how to pretrain a bidirectional model and why this paper was so impactful.

Link to the paper: https://aclanthology.org/N19-1423.pdf

Introduction¶

most other methods use unidirectional attention, which limits the capabilities for tasks such as question answering
in this paper, the authors show that a bidirectional model is stronger and performs state of the art on many tasks
fun fact: one of the models they compare to is named "ELMo"^[3]
pre-training serves as powerful base for many downstream tasks (and only requires a small fine-tuning)

BERT¶

Architecture¶

Standard Transformer encoder (from Attention is all you need) in two sizes:
- BERT-Base: 12 blocks, model dimension 768 and 12 heads (total parameters: 110M)
- BERT-Large: 24 blocks, model dimension 1024 and 16 heads (total parameters: 340M)
Vocab size of 30k (WordPiece)
Add [CLS] token as first token, add segment embeddings A and B (added to each token of the corresponding sentence)

Pre-training¶

Masked LM (MLM): Select 15% of tokens as prediction target, replace 80% of them with [MASK], 10% with a random token and leave 10% unchanged
- This mitigates shift between pre-training ([MASK]) and fine-tuning (no [MASK])
Next Sentence Prediction (NSP): From a large corpus of text, choose sentences A and B such that 50% of the time B is the sentence directly following A, and 50% of the time a random sentence from the corpus
- [CLS] is used for this prediction
Data: 800M words from BooksCorpus and 2.5B words from English Wikipedia

Fine-tuning¶

use A, B and [CLS] according to task
add one layer onto outputs from pre-trained BERT
very cheap (less than 1 TPU hour / few GPU hours)
uses supervised data

Results¶

clearly outperforms GPT on many tasks, despite having an almost identical architecture (only difference being attention masking)
new "tokens" can be introduced
- for SQuAD (QA) a start [S] and end [E] vector are introduced: the dot product of these vectors with the BERT output vectors is the score for the position being start or end of answer sequence
- for "no answer", both [S] and [E] are set to be at the class token

Ablations¶

Next Sentence Prediction is important for sentence level tasks
Left to right attention hurts performance, left-to-right + right-to-left is undesirable due to bigger costs, counter-intuitive nature for QA and being strictly less powerful
first to show that a larger model improves performance also for small datasets
BERT features can also be used in a feature-based approach without huge performance decrease, e.g., because it is computationally cheaper

Anki¶

Question: How many parameters has BERT-Base?

Answer

110M

Question: How many parameters has BERT-Large?

Answer

340M

Question: What is the difference in architecture between BERT and GPT?

Answer

BERT has bidirectional attention, whereas GPT has causal attention

Question: BERT: What are the pre-training tasks?

Answer

Masked LM and Next-Sentence-Prediction (NSP)

Question: BERT: What portion of tokens is masked during Masked LM pre-training?

Answer

15% with (80-10-10 rule)

Question: BERT: What is the specific token replacement policy for the 15% masked tokens?

Answer

80% replaced with `[MASK]`, 10% random token, 10% unchanged.

Question: BERT: Why is 80-10-10 rule important in Masked LM pre-training?

Answer

To mitigate shift between pre-training (`[MASK]` tokens) and fine-tuning (no `[MASK]` tokens)

Question: BERT: Is BERT mainly a "feature-based" or "fine-tuning-based" approach?

Answer

Fine-tuning based

Question: BERT: What is an advantage of being a "fine-tuning based" approach?

Answer

Only very few additional parameters for each downstream task, reusing pre-trained model

Question: BERT: What is an advantage of being a "feature-based" approach?

Answer

Features can be precomputed

References

Attention is all you need
Advances in neural information processing systems, 2017 ↩
Bert: Pre-training of deep bidirectional transformers for language understanding
Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), 2019 ↩
Deep Contextualized Word Representations
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), 2018 ↩