Paper summary: BERT
This summary is part of the series "Advent of LLM history". Throughout this month I will add (daily) short summaries of important and influential papers, highlighting their key findings and during that create an Anki set to remember the most important facts. See all posts here
Summary of "BERT: Pre-training of deep bidirectional transformers for language understanding"¶
After yesterday's start of this series with the original Transformer paper "Attention is all you need"[1], we continue today with basically the same architecture, but bidirectional attention: BERT[2]. Find out what the benefits are, how to pretrain a bidirectional model and why this paper was so impactful.
Link to the paper: https://aclanthology.org/N19-1423.pdf
Introduction¶
- most other methods use unidirectional attention, which limits the capabilities for tasks such as question answering
- in this paper, the authors show that a bidirectional model is stronger and performs state of the art on many tasks
- fun fact: one of the models they compare to is named "ELMo"[3]
- pre-training serves as powerful base for many downstream tasks (and only requires a small fine-tuning)
BERT¶
Architecture¶
- Standard Transformer encoder (from Attention is all you need) in two sizes:
- BERT-Base: 12 blocks, model dimension 768 and 12 heads (total parameters: 110M)
- BERT-Large: 24 blocks, model dimension 1024 and 16 heads (total parameters: 340M)
- Vocab size of 30k (WordPiece)
- Add
[CLS]token as first token, add segment embeddingsAandB(added to each token of the corresponding sentence)
Pre-training¶
- Masked LM (MLM): Select 15% of tokens as prediction target, replace 80% of them with
[MASK], 10% with a random token and leave 10% unchanged- This mitigates shift between pre-training (
[MASK]) and fine-tuning (no[MASK])
- This mitigates shift between pre-training (
- Next Sentence Prediction (NSP): From a large corpus of text, choose sentences
AandBsuch that 50% of the timeBis the sentence directly followingA, and 50% of the time a random sentence from the corpus[CLS]is used for this prediction
- Data: 800M words from BooksCorpus and 2.5B words from English Wikipedia
Fine-tuning¶
- use
A,Band[CLS]according to task - add one layer onto outputs from pre-trained BERT
- very cheap (less than 1 TPU hour / few GPU hours)
- uses supervised data
Results¶
- clearly outperforms GPT on many tasks, despite having an almost identical architecture (only difference being attention masking)
- new "tokens" can be introduced
- for SQuAD (QA) a start
[S]and end[E]vector are introduced: the dot product of these vectors with the BERT output vectors is the score for the position being start or end of answer sequence - for "no answer", both
[S]and[E]are set to be at the class token
- for SQuAD (QA) a start
Ablations¶
- Next Sentence Prediction is important for sentence level tasks
- Left to right attention hurts performance, left-to-right + right-to-left is undesirable due to bigger costs, counter-intuitive nature for QA and being strictly less powerful
- first to show that a larger model improves performance also for small datasets
- BERT features can also be used in a feature-based approach without huge performance decrease, e.g., because it is computationally cheaper
Anki¶
Question: How many parameters has BERT-Base?
Answer
110MQuestion: How many parameters has BERT-Large?
Answer
340MQuestion: What is the difference in architecture between BERT and GPT?
Answer
BERT has bidirectional attention, whereas GPT has causal attentionQuestion: BERT: What are the pre-training tasks?
Answer
Masked LM and Next-Sentence-Prediction (NSP)Question: BERT: What portion of tokens is masked during Masked LM pre-training?
Answer
15% with (80-10-10 rule)Question: BERT: What is the specific token replacement policy for the 15% masked tokens?
Answer
80% replaced with `[MASK]`, 10% random token, 10% unchanged.Question: BERT: Why is 80-10-10 rule important in Masked LM pre-training?
Answer
To mitigate shift between pre-training (`[MASK]` tokens) and fine-tuning (no `[MASK]` tokens)Question: BERT: Is BERT mainly a "feature-based" or "fine-tuning-based" approach?
Answer
Fine-tuning basedQuestion: BERT: What is an advantage of being a "fine-tuning based" approach?
Answer
Only very few additional parameters for each downstream task, reusing pre-trained modelQuestion: BERT: What is an advantage of being a "feature-based" approach?
Answer
Features can be precomputedReferences
-
Attention is all you need
Advances in neural information processing systems, 2017 ↩ -
Bert: Pre-training of deep bidirectional transformers for language understanding
Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), 2019 ↩ -
Deep Contextualized Word Representations
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), 2018 ↩