LoRA initialization
Showing that $0\neq 0$
Showing that $0\neq 0$
Let's get bidirectional
The original Transformer paper
After yesterday's look at BERT[@devlin2019bert] and the benefits of bidirectional attention, we are covering a different direction today. While previous papers focused on _how_ to build the model (arc
A closer look on weight tying and its effects on token embeddings
Step-by-step exploration of nanochat's model
Close examination of gradients of DPO and SFT