Blog

Paper summary: BERT

December 10, 2025

Let's get bidirectional

Paper summary: Attention is all you need

December 09, 2025

The original Transformer paper

Summary of "Scaling Laws for Neural Language Models"

November 11, 2025

After yesterday's look at BERT[@devlin2019bert] and the benefits of bidirectional attention, we are covering a different direction today. While previous papers focused on _how_ to build the model (arc

Weight tying does not imply embedding tying

November 10, 2025

A closer look on weight tying and its effects on token embeddings

nanochat's gpt.py

November 04, 2025

Step-by-step exploration of nanochat's model

DPO is SFT

September 07, 2025