Denoising Diffusion Probabilistic Models -- 概率扩散模型 数学推导(2)

Denoising Diffusion Probabilistic Models -- 概率扩散模型 数学推导(1)

三. 学习目标

现在回顾前面扩散过程和生成过程的公式

  • 扩散过程 \(q(x_t|x_0) \sim \mathcal{N}(\sqrt{\bar{\alpha}_t}x_0, \sqrt{1-\bar{\alpha}_t}I)\)
  • 生成过程: \(q\left(X_{t-1}|X_t,X_0\right) \sim N\left(X_{t-1};\frac{1}{\sqrt{\alpha_t}}(X_t-\frac{1-\alpha_t}{\sqrt{1-\overline{\alpha}_t}}\epsilon_t),\frac{1-\overline{\alpha}_{t-1}}{1-\overline{\alpha}_t}\left(1-\alpha_t\right)I\right)\)

对于扩散过程, 都是已知的参数, 故而不需要神经网络进行训练

对于生成过程, 方差\(\frac{1-\overline{\alpha}_{t-1}}{1-\overline{\alpha}_t}\left(1-\alpha_t\right)\)是已知参数, 但均值中的 \(\epsilon_t\) 就是前向过程中添加的高斯噪声, 在生成过程, 我们就是要使用神经网络预测添加的随机高斯噪声, 然后从\(X_0\)中去除 \(\epsilon_t\), 所以 \(\epsilon_t\)就是唯一的训练参数, 记为\(\theta\)

所以:

\[\begin{aligned} p_{\theta}(X_{t-1}|X_t) &= q(X_{t-1}|X_t) \\ &=N\left(X_{t-1};\frac{1}{\sqrt{\alpha_t}}(X_t-\frac{1-\alpha_t}{\sqrt{1-\overline{\alpha}_t}}\epsilon_t),\frac{1-\overline{\alpha}_{t-1}}{1-\overline{\alpha}_t}\left(1-\alpha_t\right)I\right) \\ \end{aligned} \]

生成过程也就可以表示为:

\[p_{\theta}(X_{0:t}) = p(X_T)\prod_{t=1}^T p_{\theta}(X_{t-1}|X_t) \]

因此,生成过程的模型\(p_θ(X_{t−1}∣X_t)\),本质就是 “用神经网络\(ϵ_θ\)预测\(ϵ_t\)”,再代入均值公式,形成完整的逆过程分布。

四. 损失函数

通过前面的推导可知, 我们在生成过程实际上是想找到一个\(\theta\), 使得\(p_{\theta}(X_0)\)最大, 我们可以使用概率模型中常见的极大似然估计来估计参数\(\theta\):

\[\\argmax_{\theta}\log p_{\theta}(X_0) \]

我们没法直接够到天花板(\(logp_θ(X_0)\)),就找一个 “离天花板最近的桌子”(上界LVLB),只要把桌子举高(最小化LVLB),就能间接接近天花板。这个桌子就是「变分下界(Evidence Lower Bound, VLB)」,推导的核心就是 “凑出这个桌子”。

\[\begin{aligned} \\argmax_{\theta}\log p_{\theta}(X_0)&=\int_{X_{1}:X_{T}}q(X_{1}:X_{T}|X_{0})\log p_{\theta}(X_{0})\,dX_{1:X_{T}}\\ &=\int_{X_{1}:X_{T}}q(X_{1}:X_{T}|X_{0})\log\frac{p_{\theta}(X_{0}:X_{T})}{p_{\theta}(X_{1}:X_{T}|X_{0})}\,dX_{1:X_{T}}\\ &=\int_{X_{1}:X_{T}}q(X_{1}:X_{T}|X_{0})\log\frac{p_{\theta}(X_{0}:X_{T})q(X_{1}:X_{T}|X_{0})}{p_{\theta}(X_{1}:X_{T}|X_{0})q(X_{1}:X_{T}|X_{0})}\,dX_{1:X_{T}}\\ &=\int_{X_{1}:X_{T}}q(X_{1}:X_{T}|X_{0})\log\frac{p_{\theta}(X_{0}:X_{T})}{q(X_{1}:X_{T}|X_{0})}\,dX_{1:X_{T}}\\ &\quad +\int_{X_{1}:X_{T}}q(X_{1}:X_{T}|X_{0})\log\frac{q(X_{1}:X_{T}|X_{0})}{p_{\theta}(X_{1}:X_{T}|X_{0})}\,dX_{1:X_{T}} \\ &=\int_{X_{1}:X_{T}}q(X_{1}:X_{T}|X_{0})\log\frac{p_{\theta}(X_{0}:X_{T})}{q(X_{1}:X_{T}|X_{0})}\,dX_{1:X_{T}}\\ &\quad +D_{KL}(q(X_{1}:X_{T}|X_{0}) \parallel p_{\theta}(X_{1}:X_{T}|X_{0})) \\ &\text{因为KL散度大于等于0} \\ &\geqslant \int_{X_{1}:X_{T}}q(X_{1}:X_{T}|X_{0})\log\frac{p_{\theta}(X_{0}:X_{T})}{q(X_{1}:X_{T}|X_{0})}\,dX_{1:X_{T}} \\ &= E_{X_{1}:X_{T}\sim q(X_{1}:X_{T}|X_{0})}\left[\log\frac{p_{\theta}(X_{0}:X_{T})}{q(X_{1}:X_{T}|X_{0})}\right] \end{aligned} \]

两边取负号

\[\begin{aligned} -\log p_{\theta}(X_0) &\leqslant - E_{X_{1}:X_{T}\sim q(X_{1}:X_{T}|X_{0})}\left[\log\frac{p_{\theta}(X_{0}:X_{T})}{q(X_{1}:X_{T}|X_{0})}\right] \\ &\leqslant E_{X_{1}:X_{T}\sim q(X_{1}:X_{T}|X_{0})}\left[\log\frac{q(X_{1}:X_{T}|X_{0})}{p_{\theta}(X_{0}:X_{T})}\right] \end{aligned} \]

即得到了:

\[\begin{aligned} L_{VLB}= E_{q(X_{0:T})}\left[\log\frac{q(X_{1:T}|X_{0})}{p_{\theta}(X_{0:T})}\right] &\geq -\log p_{\theta}(X_0) \\ \end{aligned} \]

那么最小化\(-\log p_{\theta}(X_0)\)也就变成了最小化其上界\(L_{VLB}\), 现在对\(L_{VLB}\)进行进一步解析:

\[\begin{aligned} L_{VLB}&= E_{q(X_{0:T})}\left[\log\frac{q(X_{1:T}|X_{0})}{p_{\theta}(X_{0:T})}\right] \\ &= E_{q(X_{0:T})}\left[\log \frac{\prod_{t=1}^T q(X_{t}\mid X_{t-1})}{p_{\theta}(X_T)\prod_{t=1}^T p_{\theta}(X_{t-1}\mid X_t)} \right]\\ &= E_{q(X_{0:T})}\left[-\log p_{\theta}(X_T) + \sum_{t=1}^T \log \frac{q(X_{t}\mid X_{t-1})}{p_{\theta}(X_{t-1}\mid X_t)} \right]\\ &= E_{q(X_{0:T})}\left[-\log p_{\theta}(X_T) + \sum_{t=2}^T \log \frac{q(X_{t}\mid X_{t-1})}{p_{\theta}(X_{t-1}\mid X_t)} +\log \frac{q(X_{1}\mid X_{0})}{p_{\theta}(X_{0}\mid X_1)} \right]\\ \end{aligned} \]

\[\begin{aligned} \text{由马尔科夫性质, 有:}\quad q\left(X_t \mid X_{t-1}\right)&=q\left(X_t \mid X_{t-1}, X_0\right)\\ & =\frac{q\left(X_t, X_{t-1}, X_0\right)}{q\left(X_{t-1}, X_0\right)} \\ & =\frac{q\left(X_{t-1} \mid X_t, X_0\right) q\left(X_t \mid X_0\right)}{q\left(X_{t-1} \mid X_0\right)} \\ \end{aligned} \]

\[\begin{aligned} L_{VLB} & =E_{\mathrm{q}}\left[-\log p_\theta\left(X_T\right)+\sum_{t=2}^T \log \left(\frac{q\left(X_{t-1} \mid X_t, X_0\right)}{p_\theta\left(X_{t-1} \mid X_t\right)} \cdot \frac{q\left(X_t \mid X_0\right)}{q\left(X_{t-1} \mid X_0\right)}\right)+\log \frac{q\left(X_1 \mid X_0\right)}{p_\theta\left(X_0 \mid X_1\right)}\right] \\ & =E_{\mathrm{q}}\left[-\log p_\theta\left(X_T\right)+\sum_{t=2}^T \log \frac{q\left(X_{t-1} \mid X_t, X_0\right)}{p_\theta\left(X_{t-1} \mid X_t\right)}+\sum_{t=2}^T \log \frac{q\left(X_t \mid X_0\right)}{q\left(X_{t-1} \mid X_0\right)}+\log \frac{q\left(X_1 \mid X_0\right)}{p_\theta\left(X_0 \mid X_1\right)}\right] \\ & =E_{\mathrm{q}}\left[-\log p_\theta\left(X_T\right)+\sum_{t=2}^T \log \frac{q\left(X_{t-1} \mid X_t, X_0\right)}{p_\theta\left(X_{t-1} \mid X_t\right)}+\log \frac{q\left(X_T \mid X_0\right)}{q\left(X_1 \mid X_0\right)}+\log \frac{q\left(X_1 \mid X_0\right)}{p_\theta\left(X_0 \mid X_1\right)}\right] \\ & =E_{\mathrm{q}}\left[\log \frac{1}{p_\theta\left(X_T\right)}+\sum_{t=2}^T \log \frac{q\left(X_{t-1} \mid X_t, X_0\right)}{p_\theta\left(X_{t-1} \mid X_t\right)}+\log \frac{q\left(X_T \mid X_0\right)}{q\left(X_1 \mid X_0\right)} \cdot \frac{q\left(X_1 \mid X_0\right)}{p_\theta\left(X_0 \mid X_1\right)}\right] \\ & =E_{\mathrm{q}}\left[\log \frac{1}{p_\theta\left(X_T\right)}+\sum_{t=2}^T \log \frac{q\left(X_{t-1} \mid X_t, X_0\right)}{p_\theta\left(X_{t-1} \mid X_t\right)}+\log q\left(X_T \mid X_0\right) +\log\frac{1}{p_\theta\left(X_0 \mid X_1\right)}\right] \\ & =E_{\mathrm{q}}\left[\log \frac{q\left(X_T \mid X_0\right)}{p_\theta\left(X_T\right)}+\sum_{t=2}^T \log \frac{q\left(X_{t-1} \mid X_t, X_0\right)}{p_\theta\left(X_{t-1} \mid X_t\right)}-\log p_\theta\left(X_0 \mid X_1\right)\right] \\ & =E_{\mathrm{q}}\left[D_{KL}\left(q\left(X_T \mid X_0\right) \parallel p_\theta\left(X_T\right)\right)+\sum_{t=2}^T D_{KL}\left(q\left(X_{t-1} \mid X_t, X_0\right) \parallel p_\theta\left(X_{t-1} \mid X_t\right)\right)-\log p_\theta\left(X_0 \mid X_1\right)\right] \end{aligned} \]

\(L_{VLB} = L_T + L_{T-1} + \ldots + L_0\),其中

\[\begin{aligned} L_T &= D_{KL}\left(q\left(X_T|X_0\right) \parallel p_\theta\left(X_T\right)\right)\\ L_t &= D_{KL}\left(q\left(X_{t-1}|X_t, X_0\right) \parallel p_\theta\left(X_{t-1}|X_t\right)\right), \quad 1 \leq t \leq T-1\\ L_0 &= -\log p_\theta\left(X_0|X_1\right) \end{aligned} \]

接下来分别研究\(L_T,L_t\)\(L_0\):

  • \(L_T\)不需要进行优化;因为\(q\left(X_T|X_0\right)\)是已知的前向过程,\(p_\theta\left(X_T\right)\)是已知的纯高斯噪声的分布。因此\(L_T\)已知,可以视为一个常数。

  • \(L_0\)也不需要进行优化。DDPM将\(p_\theta\left(X_0|X_1\right)\)设置为了一个固定的过程,是一个从高斯分布中导出的独立的离散形式的编码过程。

对于\(L_t\):

  • \(q\left(X_{t-1}|X_t,X_0\right)=N\left(X_{t-1};\tilde{\mu}\left(X_t,X_0\right),\widetilde{\beta}_tI\right)\) 是可以求出来的
  • \(p_{\theta}\left(X_{t-1}|X_t\right)=N\left(X_{t-1};\mu_{\theta}\left(X_t,t\right),\Sigma_{\theta}\left(X_t,t\right)\right)\), 是网络期望拟合的目标函数

由两高斯函数的KL散度为:

\[D_{KL}(P\parallel Q)=\log\frac{\sigma_2}{\sigma_1}+\frac{\sigma_1^2+\left(\mu_1-\mu_2\right)^2}{2\sigma_2^2}-\frac{1}{2} \]

\(q\left(X_{t-1}|X_t,X_0\right)\)\(p_{\theta}\left(X_{t-1}|X_t\right)\)的方差都是常数,所以需要优化的是这两个高斯分布的均值的二范数\((\mu_1-\mu_2)^2\),即优化:

\[\begin{aligned} L_{t} & =E_{q}\left[\left\|\tilde{\mu}\left(X_t,X_{0}\right)-\mu_{\theta}\left(X_t,t\right)\right\|^{2}\right] \\ & =E_{X_{0},\epsilon}\left[\left\|\frac{1}{\sqrt{\alpha_{t}}}\left(X_{t}-\frac{\beta_{t}}{\sqrt{1-\overline{\alpha}_{t}}}\epsilon_{t}\right)-\mu_{\theta}\left(X_t,t\right)\right\|^{2}\right] \end{aligned} \]

可以发现\(\mu_{\theta}\left(X_t,t\right)\)的优化目标是尽可能地接近\(\frac{1}{\sqrt{\alpha_{t}}}\left(X_{t}-\frac{\beta_{t}}{\sqrt{1-\overline{\alpha}_{t}}}\epsilon_{t}\right)\)。因为\(X_t\)\(\mu_\theta\)的输入,在\(t\)时刻是已知的,所以未知量只有\(\epsilon_t\)。因此可以将\(\mu_\theta(X_t,t)\)定义为:

\[\mu_\theta\left(X_t,t\right)=\frac{1}{\sqrt{\alpha_t}}\left(X_t-\frac{\beta_t}{\sqrt{1-\overline{\alpha}_t}}\epsilon_\theta\left(X_t,t\right)\right) \]

所以有:

\[\begin{aligned} L_{t} & =E_{X_{0},\epsilon}\left[\left\|\frac{1}{\sqrt{\alpha_{t}}}(X_{t}-\frac{\beta_{t}}{\sqrt{1-\overline{\alpha}_{t}}}\epsilon_{t})-\frac{1}{\sqrt{\alpha_{t}}}(X_{t}-\frac{\beta_{t}}{\sqrt{1-\overline{\alpha}_{t}}}\epsilon_{\theta}(X_t,t))\right\|^{2}\right] \\ & =E_{X_0,\epsilon}\left[\frac{\beta_t^{2}}{\alpha_t(1-\overline{\alpha}_t)}\left\|\epsilon_t-\epsilon_\theta(X_t,t)\right\|^2\right] \\ & \propto E_{X_{0},\epsilon}\left[\left\|\epsilon_{t}-\epsilon_{\theta}(X_t,t)\right\|^{2}\right] \end{aligned} \]

再将\(X_t=\sqrt{\overline{\alpha}_t}X_0+\sqrt{1-\overline{\alpha}_t}\epsilon_t\)带入上式

\[L_t=E_{X_0,\epsilon}\left[\left\|\epsilon_t-\epsilon_\theta(\sqrt{\overline{\alpha}_t}X_0+\sqrt{1-\overline{\alpha}_t}\epsilon_t,t)\right\|^2\right] \]

其中\(\sqrt{\overline{\alpha}_t}X_0+\sqrt{1-\overline{\alpha}_t}\epsilon_t\)其实是一个添加了高斯随机噪声的输入数据,\(\epsilon_\theta(\sqrt{\overline{\alpha}_t}X_0+\sqrt{1-\overline{\alpha}_t}\epsilon_t,t)\) 表示一个输入为\(\sqrt{\overline{\alpha}_t}X_0+\sqrt{1-\overline{\alpha}_t}\epsilon_t\)\(t\),输出为\(\epsilon_\theta\)的噪声预测网络。所以DDPM网络做的事情其实是估计扩散过程中添加的噪声。

综上,只有\(L_t\)需要被优化。经过复杂的数学推导,DDPM的损失函数其实就是上面的\(L_t\),即需要优化一个L2 loss。

五. 训练与推理

1. 训练流程

  1. 输入数据:从数据集中采样一个干净的数据样本 $ x_0 $。
  2. 前向扩散过程
    • 逐步向 $ x_0 $ 添加高斯噪声,生成一系列噪声数据 $ x_1, x_2, \dots, x_T $。
    • 每一步的扩散公式为:

      \[x_t = \sqrt{1 - \beta_t} \cdot x_{t-1} + \sqrt{\beta_t} \cdot \epsilon_t, \quad \epsilon_t \sim \mathcal{N}(0, I) \]

      其中 $ \beta_t $ 是噪声调度参数。
  3. 模型预测噪声
    • 对于每个时间步 $ t $,模型 $ N_\theta(x_t, t) $ 预测添加到 $ x_t $ 中的噪声 $ \epsilon_t $。
  4. 计算损失
    • 损失函数是预测噪声 $ N_\theta(x_t, t) $ 和实际噪声 $ \epsilon_t $ 之间的均方误差(MSE):

      \[\mathcal{L}(\theta) = \mathbb{E}_{t, x_0, \epsilon_t} \left[ \left\| \epsilon_t - N_\theta(x_t, t) \right\|^2 \right] \]

  5. 反向传播更新参数
    • 通过梯度下降法更新模型参数 $ \theta $,最小化损失函数。

2. 推理(生成)流程

  1. 初始化
    • 从标准正态分布中采样一个随机噪声 $ x_T \sim \mathcal{N}(0, I) $。
  2. 反向去噪过程
    • 从 $ t = T $ 开始,逐步去噪生成 $ x_{T-1}, x_{T-2}, \dots, x_0 $。
    • 每一步的去噪公式为:

      \[x_{t-1} = \frac{1}{\sqrt{1 - \beta_t}} \left( x_t - \frac{\beta_t}{\sqrt{1 - \bar{\alpha}_t}} N_\theta(x_t, t) \right) + \sqrt{\beta_t} \cdot z, \quad z \sim \mathcal{N}(0, I) \]

      其中:
      • $ \bar{\alpha}t = \prod^t (1 - \beta_s) $ 是累积噪声调度参数。
      • $ N_\theta(x_t, t) $ 是模型预测的噪声。
      • $ z $ 是额外添加的噪声,用于保持随机性。
  3. 生成数据
    • 当 $ t = 0 $ 时,得到生成的数据 $ x_0 $。

3. 训练与推理的对比

步骤 训练 推理
输入 干净数据 $ x_0 $ 随机噪声 $ x_T \sim \mathcal{N}(0, I) $
过程 前向扩散(添加噪声)+ 模型预测噪声 + 计算损失 反向去噪(逐步生成)
目标 最小化预测噪声与实际噪声的差异 从噪声中生成高质量数据
时间步 从 $ t = 1 $ 到 $ t = T $ 从 $ t = T $ 到 $ t = 0 $
模型作用 预测每一步添加的噪声 $ \epsilon_t $ 预测每一步的噪声 $ \epsilon_t $,用于去噪
输出 更新后的模型参数 $ \theta $ 生成的数据 $ x_0 $

posted @ 2026-03-03 09:51  liuyihh  阅读(14)  评论(0)    收藏  举报