14. DPO 详解

Direct Preference Optimization (DPO, Rafailov et al., NeurIPS 2023) 用一个惊艳的数学技巧把 RLHF 的"训 RM + PPO"两阶段收缩成一个监督学习损失。它在工程上极简（不需要 critic、不需要 rollout、不需要 RM），却能在多个评测上接近甚至超过 PPO，是 2023 年后开源对齐的事实标准之一。
本章是教程数学推导密度最高的章节——我们会一步步从 RLHF 目标推到 DPO 损失，再讨论 IPO、cDPO、RSO 等关键变体，最后给出实现细节与失败模式分析。

14.1 起点：RLHF 的 KL 约束目标

回顾第 12 章 §12.6 的 RLHF 第三阶段目标：

max_{π_{θ}} E_{x \sim D, y \sim π_{θ} (\cdot | x)} [r (x, y)] - β E_{x} [KL (π_{θ} (\cdot | x) ∥ π_{ref} (\cdot | x))]

PPO 通过策略梯度数值优化这个目标。DPO 的核心洞见：这个目标存在闭式最优解，而且最优解可以重写为"用 $π_{θ}$ 表达的隐式 reward"，从而绕过 RM 训练 + RL。

14.2 完整推导：从 RLHF 到 DPO 损失

我们将分 5 步严密推导。

步骤 1：写出 KL 约束最大化问题

固定一个 prompt $x$ ，把目标写成关于策略 $π (\cdot | x)$ 的函数：

J [π] = E_{y \sim π (\cdot | x)} [r (x, y)] - β KL (π (\cdot | x) ∥ π_{ref} (\cdot | x))

展开 KL：

J [π] = \sum_{y} π (y | x) r (x, y) - β \sum_{y} π (y | x) \log \frac{π (y | x)}{π_{ref} (y | x)}

约束： $\sum_{y} π (y | x) = 1$ 且 $π (y | x) \geq 0$ 。

注意：这里是泛函优化（对函数 $π$ 优化），但因为 $y$ 是离散变量（token 序列），可以视为对每个 $π (y | x)$ 优化的多元问题。

步骤 2：求解最优策略 $π^{*}$

引入 Lagrange 乘子 $λ (x)$ 处理归一化约束（非负约束在最终解中自动满足）：

L [π, λ] = \sum_{y} π (y | x) r (x, y) - β \sum_{y} π (y | x) \log \frac{π (y | x)}{π_{ref} (y | x)} - λ (x) (\sum_{y} π (y | x) - 1)

对 $π (y | x)$ 取偏导：

\frac{\partial L}{\partial π (y | x)} = r (x, y) - β [\log \frac{π (y | x)}{π_{ref} (y | x)} + 1] - λ (x) = 0

求解：

\log \frac{π (y | x)}{π_{ref} (y | x)} = \frac{r (x, y)}{β} - 1 - \frac{λ (x)}{β}

π (y | x) = π_{ref} (y | x) \cdot \exp (\frac{r (x, y)}{β}) \cdot \exp (- 1 - \frac{λ (x)}{β})

最后一项只与 $x$ 有关，记为 $\frac{1}{Z (x)}$ ，其中 $Z (x)$ 由归一化条件确定：

\sum_{y} π (y | x) = 1 \Rightarrow Z (x) = \sum_{y} π_{ref} (y | x) \exp (\frac{r (x, y)}{β})

最终最优策略：

π^{*} (y | x) = \frac{1}{Z (x)} π_{ref} (y | x) \exp (\frac{r (x, y)}{β})

解读：这是 Boltzmann 策略

形式与统计物理中的 Boltzmann 分布 完全相同：

$π_{ref}$ 充当先验/参考分布；
$r / β$ 充当能量；
$Z (x)$ 是配分函数；
$β$ 是温度倒数（ $β$ 大 → 温度低 → 分布尖锐）。

强化学习中称为 Maximum Entropy / Soft RL 策略（Levine 2018, Peters 2010, Haarnoja 2018），SAC 等算法也用这一形式。

步骤 3：用策略反解奖励

把上式两边取对数：

\log π^{*} (y | x) = \log π_{ref} (y | x) + \frac{r (x, y)}{β} - \log Z (x)

整理得到 隐式奖励 (implicit reward)：

r (x, y) = β \log \frac{π^{*} (y | x)}{π_{ref} (y | x)} + β \log Z (x)

这是 DPO 的关键技巧：奖励 $r$ 可以写成"最优策略与参考策略的对数比"加上一个只依赖 $x$ 的项 $β \log Z (x)$ 。

重要观察

$Z (x)$ 看上去棘手——它是对所有可能回答 $y$ 的求和，组合爆炸般大。但下一步会看到 它会消掉。

步骤 4：代入 Bradley-Terry 偏好模型

Bradley-Terry 模型（第 12 章 §12.4）说：

P (y_{w} ≻ y_{l} ∣ x) = σ (r (x, y_{w}) - r (x, y_{l}))

代入步骤 3 的 $r$ 表达式：

r (x, y_{w}) - r (x, y_{l}) = β \log \frac{π^{*} (y_{w} | x)}{π_{ref} (y_{w} | x)} + β \log Z (x) - β \log \frac{π^{*} (y_{l} | x)}{π_{ref} (y_{l} | x)} - β \log Z (x)

= β \log \frac{π^{*} (y_{w} | x)}{π_{ref} (y_{w} | x)} - β \log \frac{π^{*} (y_{l} | x)}{π_{ref} (y_{l} | x)}

关键： $Z (x)$ 在 chosen 与 rejected 之间相消！ 这是 DPO 数学的"魔法时刻"。

所以：

P (y_{w} ≻ y_{l} | x) = σ (β \log \frac{π^{*} (y_{w} | x)}{π_{ref} (y_{w} | x)} - β \log \frac{π^{*} (y_{l} | x)}{π_{ref} (y_{l} | x)})

步骤 5：DPO 损失

把 $π^{*}$ 用待训练参数 $π_{θ}$ 替换，对偏好数据 $D = {(x, y_{w}, y_{l})}$ 做 最大似然估计 (MLE)：

L_{DPO} (θ) = - E_{(x, y_{w}, y_{l}) \sim D} [\log σ (β \log \frac{π_{θ} (y_{w} | x)}{π_{ref} (y_{w} | x)} - β \log \frac{π_{θ} (y_{l} | x)}{π_{ref} (y_{l} | x)})]

这就是 DPO 损失。

关键性质

不需要训 RM：奖励被"折叠"进策略；
不需要 RL 采样：完全离线，纯监督学习；
梯度很简单：只需 $π_{θ}$ 、 $π_{ref}$ 各一次前向；
目标 = 二分类对数似然：把 $(y_{w}, y_{l})$ 视为正负例。

14.3 梯度分析：DPO 在做什么

记 ${\hat{r}}_{θ} (x, y) = β \log \frac{π_{θ} (y | x)}{π_{ref} (y | x)}$ （隐式 reward）， $Δ = {\hat{r}}_{θ} (x, y_{w}) - {\hat{r}}_{θ} (x, y_{l})$ 。

DPO 损失：

L_{DPO} = - \log σ (Δ) = \log (1 + e^{- Δ})

对参数 $θ$ 求导：

\nabla_{θ} L_{DPO} = - σ (- Δ) \cdot \nabla_{θ} Δ = - σ (- Δ) \cdot β \cdot (\nabla_{θ} \log π_{θ} (y_{w} | x) - \nabla_{θ} \log π_{θ} (y_{l} | x))

利用 $σ (- Δ) = 1 - σ (Δ)$ ：

\nabla_{θ} L_{DPO} = - β \cdot \underset{加权系数}{\underset{⏟}{σ ({\hat{r}}_{θ} (y_{l}) - {\hat{r}}_{θ} (y_{w}))}} \cdot (\nabla_{θ} \log π_{θ} (y_{w} | x) - \nabla_{θ} \log π_{θ} (y_{l} | x))

解读

加权系数 $σ ({\hat{r}}_{l} - {\hat{r}}_{w})$ 是 错误率：模型把 $y_{l}$ 预测得比 $y_{w}$ 好的"程度"；
当模型已经正确偏好 $y_{w}$ （ ${\hat{r}}_{w} ≫ {\hat{r}}_{l}$ ），系数 $\to 0$ ，梯度自动消失；
当模型搞反了，系数 $\to 1$ ，梯度最大。

这种 dynamic example weighting 是 DPO 与 naive likelihood ratio loss 的本质差别。它有点像 focal loss 的 hard example mining 效果。

对比 SFT 和 PPO 的梯度

算法	梯度形式
SFT	$\nabla_{θ} \log π_{θ} (y_{w} ∣ x)$ —— 仅推 chosen
DPO	$- σ ({\hat{r}}_{l} - {\hat{r}}_{w}) \cdot β \cdot (\nabla_{θ} \log π_{θ} (y_{w}) - \nabla_{θ} \log π_{θ} (y_{l}))$ —— 推 chosen 同时拉 rejected
PPO	$σ^{'} (r - V) \cdot \nabla_{θ} \log π_{θ} (y) \cdot \hat{A}$ —— 在线推/拉

DPO 的优势：用 chosen 与 rejected 的"对比信号"，比 SFT 单边监督更强。劣势：完全离线，无法适应分布偏移。

14.4 实现要点

14.4.1 数据格式

python

# DPO 数据示例
{
    "prompt": "Explain photosynthesis in one sentence.",
    "chosen": "Photosynthesis is the process by which plants convert sunlight, water, and CO2 into glucose and oxygen.",
    "rejected": "Plants eat light to make food."
}

每条样本必须由同一 prompt 衍生。

14.4.2 计算 log-probabilities

DPO 需要 $\log π_{θ} (y | x)$ 和 $\log π_{ref} (y | x)$ 。给定 token 序列：

\log π (y | x) = \sum_{t = 1}^{| y |} \log π (y_{t} | x, y_{< t})

实现：

python

def get_log_probs(model, input_ids, labels, attention_mask):
    """
    计算每条序列的 sum log p(y|x)。
    labels: 与 input_ids 同形，prompt 部分为 -100，response 部分为 token id
    """
    outputs = model(input_ids=input_ids, attention_mask=attention_mask)
    logits = outputs.logits[:, :-1, :]   # [B, T-1, V]
    labels = labels[:, 1:].clone()       # [B, T-1]，shift right

    # mask: 只计算 response 部分
    loss_mask = (labels != -100)
    labels[~loss_mask] = 0   # 防止 gather 报错

    log_probs = F.log_softmax(logits, dim=-1)   # [B, T-1, V]
    selected = log_probs.gather(2, labels.unsqueeze(-1)).squeeze(-1)  # [B, T-1]

    # 求和（每条序列的 log p）
    sequence_logp = (selected * loss_mask.float()).sum(dim=-1)  # [B]
    return sequence_logp

14.4.3 一次前向算双分支

最简单的做法：把 chosen 与 rejected 拼成一个 batch 一次前向：

python

def dpo_loss(model, ref_model, batch, β=0.1):
    """
    batch:
        chosen_input_ids:    [B, T_c]
        chosen_labels:       [B, T_c]
        rejected_input_ids:  [B, T_r]
        rejected_labels:     [B, T_r]
    """
    # 当前策略：一次前向算 chosen + rejected
    logp_chosen = get_log_probs(model,
                                 batch["chosen_input_ids"],
                                 batch["chosen_labels"],
                                 batch["chosen_attention_mask"])    # [B]
    logp_rejected = get_log_probs(model,
                                   batch["rejected_input_ids"],
                                   batch["rejected_labels"],
                                   batch["rejected_attention_mask"])# [B]

    # 参考策略：no_grad 前向（也可以预计算缓存）
    with torch.no_grad():
        ref_logp_chosen = get_log_probs(ref_model, ...)
        ref_logp_rejected = get_log_probs(ref_model, ...)

    # 隐式 reward
    π_logratio_w = logp_chosen - ref_logp_chosen
    π_logratio_l = logp_rejected - ref_logp_rejected

    # DPO 损失
    logits = β * (π_logratio_w - π_logratio_l)
    loss = -F.logsigmoid(logits).mean()

    # 监控指标
    chosen_rewards = β * π_logratio_w.detach()
    rejected_rewards = β * π_logratio_l.detach()
    accuracy = (chosen_rewards > rejected_rewards).float().mean()
    margin = (chosen_rewards - rejected_rewards).mean()

    return loss, {
        "acc": accuracy,
        "margin": margin,
        "chosen_reward": chosen_rewards.mean(),
        "rejected_reward": rejected_rewards.mean(),
    }

14.4.4 Reference model 处理

选项 A：常驻显存（最简单）

加载 $π_{ref}$ 到 GPU，每次 forward 调用；
显存翻倍。

选项 B：预计算缓存（推荐）

训练前一次性算出所有 $(\log π_{ref} (y_{w} | x), \log π_{ref} (y_{l} | x))$ ；
训练时只需读取缓存，省去 $π_{ref}$ 显存；
对 LoRA DPO 尤其友好。

选项 C：LoRA + 共享 backbone

训练 LoRA adapter，base = $π_{ref}$ ；
计算 $π_{θ}$ 时 enable LoRA， $π_{ref}$ disable LoRA；
仅一份 backbone，显存最优。

python

# LoRA + 共享 backbone 写法（PEFT 风格）
from peft import PeftModel

# Step 1: 计算 ref logp
with model.disable_adapter():   # 关闭 LoRA → 等价 ref
    with torch.no_grad():
        ref_logp_chosen = get_log_probs(model, ...)
        ref_logp_rejected = get_log_probs(model, ...)

# Step 2: 计算 policy logp
logp_chosen = get_log_probs(model, ...)   # LoRA 默认开启
logp_rejected = get_log_probs(model, ...)

14.4.5 关键超参

超参	典型范围	备注
$β$ (KL 强度)	0.01 ~ 0.5	越小越激进；Llama-3 instruct = 0.1
学习率	1e-7 ~ 5e-6	比 SFT 小 10×；DPO 容易过拟合
Batch size	32 ~ 128 (pairs)
Epochs	1 ~ 3	多了过拟合
Warmup	0.1	linear
Optimizer	AdamW + cosine schedule
序列长度	≤ 4K	长序列容易 OOM
Max prompt length	$\leq$ T/2	防止 prompt 撑爆

14.5 DPO 的"陷阱"：失败模式

DPO 看似简单，实际有不少坑。

14.5.1 Likelihood Displacement

最常见的失败：chosen 与 rejected 的 log-prob 都被推低，只是 chosen 降得慢一点。

为什么？回顾梯度：

\nabla_{θ} L \propto - (\nabla_{θ} \log π_{θ} (y_{w} | x) - \nabla_{θ} \log π_{θ} (y_{l} | x))

DPO 只关心 $\log π (y_{w}) - \log π (y_{l})$ 的相对差，不保证 $π_{θ} (y_{w} | x)$ 自身高。如果训练把 $y_{l}$ 的 logp 砸到极低， $y_{w}$ 跟着略降也满足 loss 下降。

后果：

实际生成质量下降（因为 chosen 路径的概率也变小了）；
模型可能转向预测分布外的 token。

缓解：

加 SFT 损失项（DPOP / DPO-Positive）： $L_{DPOP} = L_{DPO} + λ \cdot max (0, \log π_{ref} (y_{w} | x) - \log π_{θ} (y_{w} | x))$ 防止 $π_{θ} (y_{w})$ 跌破 $π_{ref} (y_{w})$ 。
或用 ORPO/SimPO 等同时含 SFT 项的损失。

14.5.2 Verbosity Bias

DPO 的隐式 reward 是 $\sum_{t} \log π_{θ} (y_{t}) / π_{ref} (y_{t})$ ，与序列长度强相关：长回答的 logp 累积更多。如果训练数据中 chosen 平均比 rejected 长，模型会学到"输出越长越好"。

实证：DPO 后的模型平均回答长度比 SFT 长 30-60%。

缓解：

SimPO：用平均 logp（除以 $| y |$ ）作为 reward，长度归一化；
数据预处理：让 chosen / rejected 长度匹配；
Length-controlled metric：评估时用 LC win rate（AlpacaEval 2）。

14.5.3 偏好数据噪声

人类标注存在不一致（IAA 70-85%）。DPO 默认假设标签 100% 正确，遇到噪声会过度自信。表现：

${\hat{r}}_{w} - {\hat{r}}_{l}$ 持续增大，KL 飙升；
实际胜率反而下降。

缓解：cDPO（下文）或加 label smoothing。

14.5.4 OOD 隐式 reward

DPO 的 ${\hat{r}}_{θ} (x, y) = β \log π_{θ} / π_{ref}$ 是隐式 RM——但它没有训练分布外保障。Rafailov et al. 2024 NeurIPS 论文 "Scaling Laws for Reward Model Overoptimization in Direct Alignment Algorithms" 表明：DPO/IPO/SLiC 等 DAA（Direct Alignment Algorithm）同样存在 over-optimization。

具体表现：

训练 1 个 epoch 后，gold metric 开始下降；
模型在 OOD 输入上隐式 reward 高得离谱；
表面 acc 仍在涨，但生成质量下降。

缓解：

早停 + 监控 gold metric；
Online/Iterative DPO；
减小 $β$ 或加 KL 正则。

14.6 DPO 的关键变体

14.6.1 cDPO (Conservative DPO with Label Noise)

Mitchell (2023) 假设偏好标签有 $ε$ 概率被反转（即 $y_{w}$ 实际是 $y_{l}$ ）。修正后的 BT 概率：

P (y_{w} ≻ y_{l} | x) = (1 - ε) σ (Δ) + ε σ (- Δ)

负对数似然损失变成：

L_{cDPO} = - (1 - ε) \log σ (β Δ_{θ}) - ε \log σ (- β Δ_{θ})

其中 $Δ_{θ} = \log π_{θ} (y_{w} | x) / π_{ref} (y_{w} | x) - \log π_{θ} (y_{l} | x) / π_{ref} (y_{l} | x)$ 。

关键性质：梯度归零

vanilla DPO 的梯度： $- β \cdot σ (- Δ) \cdot (\nabla \log π_{θ} (y_{w}) - \nabla \log π_{θ} (y_{l}))$

cDPO 的梯度（化简后）： $- β \cdot (σ (- Δ) (1 - ε) - σ (Δ) ε) \cdot (\nabla \log π_{θ} (y_{w}) - \nabla \log π_{θ} (y_{l}))$

当 $σ (Δ) = 1 - ε$ （即模型置信度恰好达到 $1 - ε$ ）时：

σ (- Δ) (1 - ε) - σ (Δ) ε = ε (1 - ε) - (1 - ε) ε = 0

梯度归零——避免过度自信。vanilla DPO 的梯度永远 > 0（除非 $Δ = + \infty$ ），导致 chosen logp 被持续推高。

实现

python

def cdpo_loss(logp_w, logp_l, ref_logp_w, ref_logp_l, β=0.1, label_smoothing=0.1):
    Δ = β * ((logp_w - ref_logp_w) - (logp_l - ref_logp_l))
    loss = -(1 - label_smoothing) * F.logsigmoid(Δ) \
           -      label_smoothing  * F.logsigmoid(-Δ)
    return loss.mean()

TRL DPOTrainer(loss_type="sigmoid", label_smoothing=0.1) 即 cDPO。

14.6.2 IPO (Identity Preference Optimization)

Azar et al. (2023, DeepMind) 在 "A General Theoretical Paradigm to Understand Learning from Human Preferences" 中提出 $Ψ$ PO 框架：

L_{Ψ} (π) = E_{x, y \sim π, y^{'} \sim μ} [Ψ (p^{*} (y ≻ y^{'} | x))] - τ KL (π ∥ π_{ref})

$Ψ (p) = \log \frac{p}{1 - p}$ （logit）：退化为 RLHF/DPO；
$Ψ (p) = p$ （identity）：得到 IPO。

IPO 的实用形式

经过推导（详见原论文 Appendix），IPO 损失等价于：

L_{IPO} (θ) = E_{(x, y_{w}, y_{l})} [(h_{θ}^{y_{w}, y_{l}} (x) - \frac{1}{2 β})^{2}]

其中：

h_{θ}^{y_{w}, y_{l}} (x) = \log \frac{π_{θ} (y_{w} | x)}{π_{ref} (y_{w} | x)} - \log \frac{π_{θ} (y_{l} | x)}{π_{ref} (y_{l} | x)}

关键优势：避免 DPO 过拟合

DPO 在数据中 " $y_{w}$ 全胜" 时会把 $π_{θ} (y_{w}) \to 1$ （隐式 reward 趋于 $+ \infty$ ），完全忽视 $π_{ref}$ 。这是因为 DPO 损失 $- \log σ (Δ)$ 可以无限优化（ $Δ \to \infty$ 时损失 $\to 0$ ）。

IPO 把 reward margin 锚定到固定值 $1 / (2 β)$ ，超过就开始受到惩罚（平方损失）。这避免了 over-confidence。

实证：在偏好数据高度确定（如合成数据）的场景下，IPO 显著优于 DPO。

实现

python

def ipo_loss(logp_w, logp_l, ref_logp_w, ref_logp_l, β=0.1):
    h = (logp_w - ref_logp_w) - (logp_l - ref_logp_l)
    target = 1.0 / (2 * β)
    loss = (h - target).pow(2).mean()
    return loss

TRL DPOTrainer(loss_type="ipo") 直接支持。

14.6.3 RSO (Statistical Rejection Sampling Optimization)

Liu et al. (2023)。问题：DPO/IPO 用任意行为策略 $μ$ 采样的偏好数据，但理论上最优策略 $π^{*} \propto π_{ref} \exp (r / β)$ 与 $μ$ 不同分布。

RSO 流程：

训一个 BT 奖励模型 $r_{ϕ}$ ；
用 $π_{ref}$ 生成多个候选 $y$ ；
用 拒绝采样 从 $π^{*}$ 近似采样：保留概率 $\propto \exp (r_{ϕ} / β) / M$ ；
在拒绝采样得到的样本上做 DPO/IPO。

RSO 让训练数据更接近 $π^{*}$ ，从而 DPO 收敛到的策略更接近真正的最优。但代价是要先训 RM——回到了 RLHF 的两阶段范式。

14.6.4 β-DPO（自适应 β）

Wu et al. 2024 提出按样本调整 $β$ ，缓解长样本梯度过大问题。简化版：

β_{i} = β_{0} \cdot \frac{1}{| y_{w}^{(i)} | + | y_{l}^{(i)} |}

或基于 KL 的反馈调整。

14.6.5 Robust DPO

Chowdhury et al. 2024。对每对偏好加权重 $w_{i}$ ，削弱不一致样本：

L_{Robust} = - \sum_{i} w_{i} \log σ (β Δ_{i})

权重可由置信度估计、或对偶变量优化得到。

14.6.6 SLiC (Sequence Likelihood Calibration)

Zhao et al. 2023 提出 hinge loss 形式：

L_{SLiC} = max (0, δ - \log π_{θ} (y_{w} | x) + \log π_{θ} (y_{l} | x)) + λ L_{SFT}

类似 DPO 但用 hinge 替代 sigmoid，且不需 reference model。

14.7 Online DPO / Iterative DPO

14.7.1 离线 DPO 的局限

DPO 是 offline RL：用固定数据集 $D$ 训练。当 $π_{θ}$ 偏离 $D$ 的采样分布时（这在训练几个 epoch 后必然发生），梯度信号变得不可信，类似 PPO 的 distribution shift 问题。

14.7.2 Iterative DPO

for iter = 1..T:
    1. 用当前 π_θ 在新 prompts 上生成 K 个回答
    2. 用外部 RM（或 LLM-as-judge）排序得到偏好对
    3. 在新偏好对上做 1-2 个 epoch 的 DPO

每轮迭代相当于"重新采样 + 重新对齐"，逐步逼近 PPO 的在线特性。Llama-3 instruct、Tülu-3 等都采用了这种迭代范式。

14.7.3 Self-Rewarding LM

Yuan et al. (2024) 进一步把 RM 也内化：模型自己当 judge，用 LLM-as-a-judge prompt 给自己生成的回答打分。三轮迭代 (M1 → M2 → M3) 持续提升。

14.7.4 Online DPO

每个 step 都生成新数据：

for step = 1..N:
    # 同 PPO 一样的 rollout，但用 DPO 损失
    sample (x, y1, y2) from π_θ
    label preference using RM or rule
    DPO step on this single pair

理论上等价于一种特殊的 PPO（用 BT 视角的 advantage）。OpenAI 的 GPT-4 据信使用类似方案。

14.7.5 OAIF (Online AI Feedback)

Guo et al. 2024。用一个固定的强 LLM 在线生成偏好（替代 RM），然后做 online DPO。在多个 benchmark 上接近 PPO + 真实人类标注。

14.8 DPO vs PPO：全面对比

维度	PPO	DPO
数学形式	策略梯度 + 重要性采样	闭式最优策略 + BT 监督学习
模型数	4 (actor, critic, ref, RM)	2 (actor, ref)
是否需 RM	是	否（隐式）
是否需 rollout	是	否
数据使用	在线	离线（可迭代变在线）
显存	高	中
调参复杂度	高（10+ 超参）	低（主要是 β、lr）
训练稳定性	需大量 trick	相对稳定但有 likelihood displacement
数据效率	一次 rollout 多次更新	每条偏好对一次梯度
分布偏移	在线适应	离线易过拟合
当前应用	OpenAI、Anthropic、LLaMA-2/3	开源主流（Mistral、Tülu、Zephyr）
性能	上限更高	中位数性能强

何时选 PPO

已有大规模偏好数据 + 好用的 RM；
显存预算充足；
团队有 RLHF 调参经验；
追求最高性能。

何时选 DPO

资源受限（小公司 / 学术研究）；
偏好数据中等规模（10K - 100K）；
想快速 iterate；
配合 LoRA 做轻量对齐。

14.9 完整实现示例

下面给一个简洁的 DPO 训练循环（完整代码见 code/07_dpo_training.py）。

python

import torch
import torch.nn.functional as F
from transformers import AutoModelForCausalLM, AutoTokenizer
from torch.utils.data import Dataset, DataLoader

class DPODataset(Dataset):
    def __init__(self, jsonl_path, tokenizer, max_len=2048):
        self.data = [json.loads(l) for l in open(jsonl_path)]
        self.tok = tokenizer
        self.max_len = max_len

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        ex = self.data[idx]
        prompt = ex["prompt"]
        chosen = ex["chosen"]
        rejected = ex["rejected"]

        # 拼接 prompt + response
        chosen_full = self.tok(prompt + chosen, truncation=True,
                                max_length=self.max_len, return_tensors="pt")
        rejected_full = self.tok(prompt + rejected, truncation=True,
                                  max_length=self.max_len, return_tensors="pt")

        # 计算 prompt 长度，用于 mask
        prompt_len = len(self.tok(prompt)["input_ids"])

        return {
            "chosen_input_ids": chosen_full.input_ids[0],
            "chosen_attention_mask": chosen_full.attention_mask[0],
            "chosen_prompt_len": prompt_len,
            "rejected_input_ids": rejected_full.input_ids[0],
            "rejected_attention_mask": rejected_full.attention_mask[0],
            "rejected_prompt_len": prompt_len,
        }


def get_log_probs(model, input_ids, attention_mask, prompt_lens):
    """计算每条序列 response 部分的 log p(y|x) 总和"""
    outputs = model(input_ids=input_ids, attention_mask=attention_mask)
    logits = outputs.logits[:, :-1, :]    # [B, T-1, V]
    targets = input_ids[:, 1:]            # [B, T-1]

    # 构造 loss mask：response token 为 1，prompt/padding 为 0
    B, T = input_ids.shape
    positions = torch.arange(T-1, device=input_ids.device).unsqueeze(0)  # [1, T-1]
    response_mask = (positions >= (prompt_lens - 1).unsqueeze(1)) \
                  & (attention_mask[:, 1:] > 0)

    log_probs = F.log_softmax(logits, dim=-1)
    selected = log_probs.gather(2, targets.unsqueeze(-1)).squeeze(-1)  # [B, T-1]

    sequence_logp = (selected * response_mask.float()).sum(dim=-1)     # [B]
    return sequence_logp


def dpo_step(model, ref_model, batch, β=0.1, loss_type="sigmoid",
             label_smoothing=0.0):
    # Policy log probs
    logp_w = get_log_probs(model,
                           batch["chosen_input_ids"],
                           batch["chosen_attention_mask"],
                           batch["chosen_prompt_len"])
    logp_l = get_log_probs(model,
                           batch["rejected_input_ids"],
                           batch["rejected_attention_mask"],
                           batch["rejected_prompt_len"])

    # Reference log probs（no_grad）
    with torch.no_grad():
        ref_logp_w = get_log_probs(ref_model,
                                    batch["chosen_input_ids"],
                                    batch["chosen_attention_mask"],
                                    batch["chosen_prompt_len"])
        ref_logp_l = get_log_probs(ref_model,
                                    batch["rejected_input_ids"],
                                    batch["rejected_attention_mask"],
                                    batch["rejected_prompt_len"])

    Δ_θ = (logp_w - ref_logp_w) - (logp_l - ref_logp_l)

    if loss_type == "sigmoid":   # vanilla DPO 或 cDPO
        if label_smoothing > 0:
            loss = -(1 - label_smoothing) * F.logsigmoid(β * Δ_θ) \
                   -      label_smoothing  * F.logsigmoid(-β * Δ_θ)
        else:
            loss = -F.logsigmoid(β * Δ_θ)
    elif loss_type == "ipo":
        target = 1.0 / (2 * β)
        loss = (Δ_θ - target).pow(2)
    else:
        raise ValueError(f"Unknown loss_type: {loss_type}")

    metrics = {
        "loss": loss.mean().item(),
        "rewards/chosen": (β * (logp_w - ref_logp_w)).detach().mean().item(),
        "rewards/rejected": (β * (logp_l - ref_logp_l)).detach().mean().item(),
        "rewards/margin": (β * Δ_θ).detach().mean().item(),
        "rewards/accuracy": (Δ_θ > 0).float().mean().item(),
    }

    return loss.mean(), metrics


def train_dpo(model, ref_model, loader, optimizer, num_epochs=1, β=0.1):
    model.train()
    ref_model.eval()
    for epoch in range(num_epochs):
        for step, batch in enumerate(loader):
            batch = {k: v.cuda() for k, v in batch.items()}
            loss, metrics = dpo_step(model, ref_model, batch, β=β)

            optimizer.zero_grad()
            loss.backward()
            torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
            optimizer.step()

            if step % 10 == 0:
                print(f"epoch {epoch} step {step}: {metrics}")

完整版加上：

DDP/FSDP 多卡；
ref logp 缓存；
LoRA 支持；
TensorBoard / wandb 日志；
验证集评估（pairwise acc、KL 估计）。

14.10 监控与诊断

DPO 训练时建议监控：

指标	健康范围	异常诊断
`rewards/accuracy`	持续上升至 70-90%	不涨 → β/lr 不对；爆 100% 立即 → 过拟合
`rewards/margin`	缓慢上升至 1-5	飙升 > 10 → likelihood explosion
`rewards/chosen`	接近 0（参考线）	持续负 → chosen logp 被推低（坏）
`rewards/rejected`	强负值	太负 → 模型已经"放弃" rejected
KL估计 = $β ({\hat{r}}_{w} + {\hat{r}}_{l}) / 2$ 等	缓慢上升	飙升 → 早停
验证 pairwise acc	同步上升	训练涨验证不涨 → 过拟合

特别推荐 每 N 步生成几条样本 做人工抽检——很多问题（重复、空洞、风格异常）只能从生成里看出来。

本章小结

DPO 通过 闭式最优策略 + BT 替换 把 RLHF 收缩为单步监督学习；
关键数学：(1) Boltzmann 形式的 $π^{*}$ ；(2) 用 $π$ 反解 $r$ 中 $Z (x)$ 在 chosen/rejected 之间相消；(3) BT-MLE 给出最终损失；
实现简单：仅需 actor + reference，2 模型；
但有陷阱：likelihood displacement、verbosity bias、OOD over-optimization；
主要变体：cDPO（标签噪声）、IPO（避免过拟合）、RSO（更优采样）、Online/Iterative DPO（缓解分布偏移）；
vs PPO：工程更友好但上限略低；与 PPO + 大 RM 的差距随 RM 质量提升而扩大。

思考题

推导验证：在 §14.2 步骤 4 中，我们说 " $β \log Z (x)$ 在 chosen 与 rejected 间相消"。请验证：如果 BT 模型换成"非对称的"形式，例如 $P (y_{w} ≻ y_{l} | x) = σ (r (x, y_{w}) + c (x) - r (x, y_{l}))$ （即偏好概率额外依赖一个 $c (x)$ ）， $Z (x)$ 是否仍能消掉？这给我们什么启示？
比较 DPO 与 SFT-only 的梯度：当数据只有 chosen 没有 rejected（即 SFT），等价于 DPO 中令 $\log π_{θ} (y_{l} | x) = \log π_{ref} (y_{l} | x)$ 。在这种情况下，DPO 梯度退化为什么形式？为什么仍不等价于纯 SFT？
工程题：你训练 DPO 时发现 rewards/accuracy 在第 1 个 epoch 末达到 95%，但人工抽检显示生成质量明显下降（输出冗长、套话多）。请提出一套诊断 + 修复流程，至少包含 3 个具体可执行的步骤（每步说出做什么、为什么）。

14. DPO 详解 ​

14.1 起点：RLHF 的 KL 约束目标 ​

14.2 完整推导：从 RLHF 到 DPO 损失 ​

步骤 1：写出 KL 约束最大化问题 ​

步骤 2：求解最优策略 π∗ ​

解读：这是 Boltzmann 策略 ​

步骤 3：用策略反解奖励 ​

重要观察 ​

步骤 4：代入 Bradley-Terry 偏好模型 ​

步骤 5：DPO 损失 ​

关键性质 ​

14.3 梯度分析：DPO 在做什么 ​

解读 ​

对比 SFT 和 PPO 的梯度 ​

14.4 实现要点 ​

14.4.1 数据格式 ​

14.4.2 计算 log-probabilities ​

14.4.3 一次前向算双分支 ​

14.4.4 Reference model 处理 ​

14.4.5 关键超参 ​

14.5 DPO 的"陷阱"：失败模式 ​

14.5.1 Likelihood Displacement ​

14.5.2 Verbosity Bias ​

14.5.3 偏好数据噪声 ​

14.5.4 OOD 隐式 reward ​

14.6 DPO 的关键变体 ​

14.6.1 cDPO (Conservative DPO with Label Noise) ​

关键性质：梯度归零 ​

实现 ​

14.6.2 IPO (Identity Preference Optimization) ​

IPO 的实用形式 ​

关键优势：避免 DPO 过拟合 ​

实现 ​

14.6.3 RSO (Statistical Rejection Sampling Optimization) ​

14.6.4 β-DPO（自适应 β） ​

14.6.5 Robust DPO ​

14.6.6 SLiC (Sequence Likelihood Calibration) ​

14.7 Online DPO / Iterative DPO ​

14.7.1 离线 DPO 的局限 ​

14.7.2 Iterative DPO ​

14.7.3 Self-Rewarding LM ​

14.7.4 Online DPO ​

14.7.5 OAIF (Online AI Feedback) ​

14.8 DPO vs PPO：全面对比 ​

何时选 PPO ​

何时选 DPO ​

14.9 完整实现示例 ​

14.10 监控与诊断 ​

本章小结 ​

思考题 ​

14. DPO 详解

14.1 起点：RLHF 的 KL 约束目标

14.2 完整推导：从 RLHF 到 DPO 损失

步骤 1：写出 KL 约束最大化问题

步骤 2：求解最优策略 $π^{*}$

解读：这是 Boltzmann 策略

步骤 3：用策略反解奖励

重要观察

步骤 4：代入 Bradley-Terry 偏好模型

步骤 5：DPO 损失

关键性质

14.3 梯度分析：DPO 在做什么

解读

对比 SFT 和 PPO 的梯度

14.4 实现要点

14.4.1 数据格式

14.4.2 计算 log-probabilities

14.4.3 一次前向算双分支

14.4.4 Reference model 处理

14.4.5 关键超参

14.5 DPO 的"陷阱"：失败模式

14.5.1 Likelihood Displacement

14.5.2 Verbosity Bias

14.5.3 偏好数据噪声

14.5.4 OOD 隐式 reward

14.6 DPO 的关键变体

14.6.1 cDPO (Conservative DPO with Label Noise)

关键性质：梯度归零

实现

14.6.2 IPO (Identity Preference Optimization)

IPO 的实用形式

关键优势：避免 DPO 过拟合

实现

14.6.3 RSO (Statistical Rejection Sampling Optimization)

14.6.4 β-DPO（自适应 β）

14.6.5 Robust DPO

14.6.6 SLiC (Sequence Likelihood Calibration)

14.7 Online DPO / Iterative DPO

14.7.1 离线 DPO 的局限

14.7.2 Iterative DPO

14.7.3 Self-Rewarding LM

14.7.4 Online DPO

14.7.5 OAIF (Online AI Feedback)

14.8 DPO vs PPO：全面对比

何时选 PPO

何时选 DPO

14.9 完整实现示例

14.10 监控与诊断

本章小结

思考题