DPO is SFT

September 07, 2025 · 8 min read · Markdown version

DPO is SFT (with relative sequence weighting)¶

updated May 06, 2026

Training LLMs usually involves multiple stages. One of them is supervised fine-tuning (SFT), where the model is trained with cross-entropy loss on prompt-completion pairs. Another common stage is preference alignment, where the model is trained on ranked completions. Direct Preference Optimization (DPO) ^[1] is one of the simplest and most widely used methods for this.

At first glance, SFT and DPO look quite different. SFT trains on one target completion. DPO trains on a pair of completions: one preferred and one dispreferred. SFT uses token-level cross-entropy. DPO uses a sequence-level preference loss.

But let's start with the higher-level overview:

SFT with cross-entropy loss¶

What would deep learning be without cross-entropy loss? It's not only a multi-time-proven, well-working loss function that can be efficiently and stably implemented, but it is also deeply rooted in a probabilistic theoretical framework. I won't start from the concept of entropy here, even though it is quite interesting (see this article for more information).

In LLM training, SFT is usually employed after the pretraining (which also uses cross-entropy loss) to teach the model chat template, instruction following, and some general concepts and behaviors. For this post, we simplify the dataset to single-turn prompt-answer pairs:

$$ \mathcal{D}_\text{SFT} = \{(x^{(1)}, y^{(1)}), (x^{(2)}, y^{(2)}), \dots, (x^{(n)}, y^{(n)})\} $$

where $x^{(i)}$ is a prompt (e.g. question or instruction) and $y^{(i)}$ is the expected answer.

During training, the model predicts each answer token conditioned on the prompt and all previous answer tokens:

$$ \pi_\theta(y_i \mid x, y_{< i}) $$

The cross-entropy loss together with the optimizer then pushes the predicted probability distribution over all tokens of the vocabulary towards the one-hot encoded ground-truth. We can see that when we look at the gradient for one sequence:

For a single sequence $y$ of length $N$, the token-averaged cross-entropy loss is:

$$ \begin{aligned} \mathcal{L}_\text{CE} &= -\frac{1}{N} \sum_{i=1}^{N} \sum_{j=1}^{V} y_{i,j} \log(\hat{y}_{i,j}) \\ &= -\frac{1}{N} \sum_{i=1}^{N} \log(\hat{y}_{i,t_i}) \end{aligned} $$

where:

$N$ is the sequence length
$V$ is the vocabulary size
$y_{i,j}$ is the target probability for token $j$ at position $i$
$\hat{y}_{i,j}$ is the model probability for token $j$ at position $i$
$t_i$ is the true token at position $i$

For one token prediction, the gradient of cross-entropy with respect to the logits $z$ is:

$$ \frac{\partial \mathcal{L}_\text{CE}}{\partial z} = \hat{y} - e_y $$

where $e_y$ is the one-hot vector of the target token.

This is the important part. For the correct token, the gradient is $\hat{y}_k - 1$, which is negative. Under gradient descent, this increases the corresponding logit. For all incorrect tokens, the gradient is $\hat{y}_k$, which is positive, so gradient descent decreases those logits.

Cross-entropy therefore does exactly what we expect: it increases the models logits for the target token and decreases the scores of the alternatives.

Alignment with Direct Preference Optimization¶

After teaching the model basic templates, instruction following, and some general behaviors, an additional alignment stage is necessary to further improve the quality of predictions. DPO (as well as other alignment methods) enables the teaching of hard-to-describe behaviors, output formats, language usage, and structure. I definitely recommend the paper, since it is well-written and does a good job motivating the method and explaining it in a simple way.

DPO requires an already pre-trained and fine-tuned model, since the first step is to sample multiple completions for a set of prompts. For this post we assume just two completions, which are ranked and labelled as preferred (commonly referred to as winning) and dispreferred (losing):

$$ \mathcal{D}_\text{DPO} = \{(x^{(1)}, y_w^{(1)}, y_l^{(1)}), (x^{(2)}, y_w^{(2)}, y_l^{(2)}), \dots, (x^{(n)}, y_w^{(n)}, y_l^{(n)})\} $$

During training, both sequence probabilities are computed, and the objective of the loss is to increase the likelihood of the preferred completion while simultaneously decreasing the probability of the dispreferred one.

For a sequence $y$, the sequence probability is the product of token probabilities:

$$ \pi_\theta(y \mid x) = \prod_{i=1}^{N} \pi_\theta(y_i \mid x, y_{< i}) $$

Since it is often more convenient to use log-probabilities:

$$ \log \pi_\theta(y \mid x) = \sum_{i=1}^{N} \log \pi_\theta(y_i \mid x, y_{< i}) $$

The DPO loss is:

$$ \mathcal{L}_\text{DPO} = - \mathbb{E}_{(x,y_w,y_l)\sim \mathcal{D}} \left[ \log \sigma \left( \beta \log \frac{\pi_\theta(y_w \mid x)} {\pi_\text{ref}(y_w \mid x)} - \beta \log \frac{\pi_\theta(y_l \mid x)} {\pi_\text{ref}(y_l \mid x)} \right) \right] $$

To simplify notation, we can define the implicit reward:

$$ r_\theta(y \mid x) = \beta \log \frac{\pi_\theta(y \mid x)} {\pi_\text{ref}(y \mid x)} $$

and the preference margin:

$$ \Delta = r_\theta(y_w \mid x) - r_\theta(y_l \mid x) $$

Then the single-example DPO loss becomes just:

$$ \mathcal{L}_\text{DPO} = - \log \sigma(\Delta) $$

DPO therefore rewards the model when the preferred completion has a higher normalized sequence probability than the dispreferred completion.

Now to see the connection of SFT and DPO, we need to look at the gradients of DPO as well. For one preference pair:

$$ \mathcal{L}_\text{DPO} = -\log \sigma(\Delta) $$

where:

$$ \Delta = r_\theta(y_w \mid x) - r_\theta(y_l \mid x) $$

and:

$$ r_\theta(y \mid x) = \beta \log \frac{\pi_\theta(y \mid x)} {\pi_\text{ref}(y \mid x)} $$

The derivative of the loss with respect to the preference margin is:

$$ \frac{\partial \mathcal{L}_\text{DPO}}{\partial \Delta} = -\sigma(-\Delta) $$

So if the model already strongly prefers the winning completion, then $\Delta$ is large and positive, and the gradient becomes small. If the model prefers the losing completion, then $\Delta$ is negative, and the gradient becomes large.

Now let us take a look at a token in the preferred sequence. The reference policy does not depend on the current model logits, so only $\log \pi_\theta(y_w \mid x)$ contributes to the logit gradient.

For the preferred sequence, we get:

$$ \frac{\partial \mathcal{L}_\text{DPO}}{\partial z^{(w)}} = \beta \sigma(-\Delta) \left( \hat{y}^{(w)} - e_{y_w} \right) $$

This is the same gradient form as cross-entropy, scaled by the sequence-level factor $\beta \sigma(-\Delta)$.

For the dispreferred sequence, the sign is inverted:

$$ \frac{\partial \mathcal{L}_\text{DPO}}{\partial z^{(l)}} = -\beta \sigma(-\Delta) \left( \hat{y}^{(l)} - e_{y_l} \right) $$

So DPO applies:

an SFT-like update to the preferred completion
an anti-SFT update to the dispreferred completion
the same sequence-level scaling factor to both

The scaling factor depends on the model's current relative preference:

If the model already strongly prefers $y_w$ over $y_l$, then $\Delta$ is large and positive, so $\sigma(-\Delta)$ is close to $0$. The update is small.
If the model is uncertain, then $\Delta \approx 0$, so $\sigma(-\Delta) \approx 0.5$. The update is moderate.
If the model prefers $y_l$ over $y_w$, then $\Delta$ is negative, so $\sigma(-\Delta)$ is close to $1$. The update is large.

This is the main conceptual difference from SFT. SFT always trains on the target completion. DPO trains harder when the model's relative preference is wrong.

Besides that there is another useful consequence of the DPO formulation: If the preferred and dispreferred completions share a prefix, the DPO updates cancel on the identical part.

Suppose both completions start with the same tokens. For those positions, the context is the same, the model distribution is the same, and the target token is also the same. Therefore:

$$ \left(\hat{y} - e_y\right) - \left(\hat{y} - e_y\right) = 0 $$

So DPO does not update the model on tokens that are identical in both completions.

At the first token where the completions diverge, the model distribution is still computed from the same context, but the target tokens differ. The distribution terms cancel, and only the one-hot difference remains.

After the completions have diverged, the contexts are different, so the distributions no longer necessarily cancel. At that point, DPO applies an SFT-like update to the preferred continuation and an anti-SFT update to the dispreferred continuation.

Summary¶

To summarize, we saw that the gradients of cross-entropy and DPO have the same core part $\hat{y} - e_y$, but DPO has a crucial sequence-level weighting factor

$$ \beta \sigma(-\Delta) $$

that reduces the gradient for sequences where the preference of the model is already aligned to the expected one.

Feel free to report errors or provide feedback to [email protected]

BibTeX

@misc{dpo-is-sft-2025,
  author = {Alexander Weers},
  title = {DPO is SFT},
  url = {https://aweers.de/blog/2025/dpo-is-sft/},
}

References

Direct preference optimization: Your language model is secretly a reward model
Rafael Rafailov et al. Advances in neural information processing systems, 2023 ↩