Denoising Diffusion Probabilistic Models -- 概率扩散模型 数学推导(1)
三. 学习目标
现在回顾前面扩散过程和生成过程的公式
- 扩散过程 \(q(x_t|x_0) \sim \mathcal{N}(\sqrt{\bar{\alpha}_t}x_0, \sqrt{1-\bar{\alpha}_t}I)\)
- 生成过程: \(q\left(X_{t-1}|X_t,X_0\right) \sim N\left(X_{t-1};\frac{1}{\sqrt{\alpha_t}}(X_t-\frac{1-\alpha_t}{\sqrt{1-\overline{\alpha}_t}}\epsilon_t),\frac{1-\overline{\alpha}_{t-1}}{1-\overline{\alpha}_t}\left(1-\alpha_t\right)I\right)\)
对于扩散过程, 都是已知的参数, 故而不需要神经网络进行训练
对于生成过程, 方差\(\frac{1-\overline{\alpha}_{t-1}}{1-\overline{\alpha}_t}\left(1-\alpha_t\right)\)是已知参数, 但均值中的 \(\epsilon_t\) 就是前向过程中添加的高斯噪声, 在生成过程, 我们就是要使用神经网络预测添加的随机高斯噪声, 然后从\(X_0\)中去除 \(\epsilon_t\), 所以 \(\epsilon_t\)就是唯一的训练参数, 记为\(\theta\)
所以:
\[\begin{aligned}
p_{\theta}(X_{t-1}|X_t) &= q(X_{t-1}|X_t) \\
&=N\left(X_{t-1};\frac{1}{\sqrt{\alpha_t}}(X_t-\frac{1-\alpha_t}{\sqrt{1-\overline{\alpha}_t}}\epsilon_t),\frac{1-\overline{\alpha}_{t-1}}{1-\overline{\alpha}_t}\left(1-\alpha_t\right)I\right) \\
\end{aligned}
\]
生成过程也就可以表示为:
\[p_{\theta}(X_{0:t}) = p(X_T)\prod_{t=1}^T p_{\theta}(X_{t-1}|X_t)
\]
因此,生成过程的模型\(p_θ(X_{t−1}∣X_t)\),本质就是 “用神经网络\(ϵ_θ\)预测\(ϵ_t\)”,再代入均值公式,形成完整的逆过程分布。
四. 损失函数
通过前面的推导可知, 我们在生成过程实际上是想找到一个\(\theta\), 使得\(p_{\theta}(X_0)\)最大, 我们可以使用概率模型中常见的极大似然估计来估计参数\(\theta\):
\[\\argmax_{\theta}\log p_{\theta}(X_0)
\]
我们没法直接够到天花板(\(logp_θ(X_0)\)),就找一个 “离天花板最近的桌子”(上界LVLB),只要把桌子举高(最小化LVLB),就能间接接近天花板。这个桌子就是「变分下界(Evidence Lower Bound, VLB)」,推导的核心就是 “凑出这个桌子”。
\[\begin{aligned}
\\argmax_{\theta}\log p_{\theta}(X_0)&=\int_{X_{1}:X_{T}}q(X_{1}:X_{T}|X_{0})\log p_{\theta}(X_{0})\,dX_{1:X_{T}}\\
&=\int_{X_{1}:X_{T}}q(X_{1}:X_{T}|X_{0})\log\frac{p_{\theta}(X_{0}:X_{T})}{p_{\theta}(X_{1}:X_{T}|X_{0})}\,dX_{1:X_{T}}\\
&=\int_{X_{1}:X_{T}}q(X_{1}:X_{T}|X_{0})\log\frac{p_{\theta}(X_{0}:X_{T})q(X_{1}:X_{T}|X_{0})}{p_{\theta}(X_{1}:X_{T}|X_{0})q(X_{1}:X_{T}|X_{0})}\,dX_{1:X_{T}}\\
&=\int_{X_{1}:X_{T}}q(X_{1}:X_{T}|X_{0})\log\frac{p_{\theta}(X_{0}:X_{T})}{q(X_{1}:X_{T}|X_{0})}\,dX_{1:X_{T}}\\
&\quad +\int_{X_{1}:X_{T}}q(X_{1}:X_{T}|X_{0})\log\frac{q(X_{1}:X_{T}|X_{0})}{p_{\theta}(X_{1}:X_{T}|X_{0})}\,dX_{1:X_{T}} \\
&=\int_{X_{1}:X_{T}}q(X_{1}:X_{T}|X_{0})\log\frac{p_{\theta}(X_{0}:X_{T})}{q(X_{1}:X_{T}|X_{0})}\,dX_{1:X_{T}}\\
&\quad +D_{KL}(q(X_{1}:X_{T}|X_{0}) \parallel p_{\theta}(X_{1}:X_{T}|X_{0})) \\
&\text{因为KL散度大于等于0} \\
&\geqslant \int_{X_{1}:X_{T}}q(X_{1}:X_{T}|X_{0})\log\frac{p_{\theta}(X_{0}:X_{T})}{q(X_{1}:X_{T}|X_{0})}\,dX_{1:X_{T}} \\
&= E_{X_{1}:X_{T}\sim q(X_{1}:X_{T}|X_{0})}\left[\log\frac{p_{\theta}(X_{0}:X_{T})}{q(X_{1}:X_{T}|X_{0})}\right]
\end{aligned}
\]
两边取负号
\[\begin{aligned}
-\log p_{\theta}(X_0) &\leqslant - E_{X_{1}:X_{T}\sim q(X_{1}:X_{T}|X_{0})}\left[\log\frac{p_{\theta}(X_{0}:X_{T})}{q(X_{1}:X_{T}|X_{0})}\right] \\
&\leqslant E_{X_{1}:X_{T}\sim q(X_{1}:X_{T}|X_{0})}\left[\log\frac{q(X_{1}:X_{T}|X_{0})}{p_{\theta}(X_{0}:X_{T})}\right]
\end{aligned}
\]
即得到了:
\[\begin{aligned}
L_{VLB}= E_{q(X_{0:T})}\left[\log\frac{q(X_{1:T}|X_{0})}{p_{\theta}(X_{0:T})}\right] &\geq -\log p_{\theta}(X_0) \\
\end{aligned}
\]
那么最小化\(-\log p_{\theta}(X_0)\)也就变成了最小化其上界\(L_{VLB}\), 现在对\(L_{VLB}\)进行进一步解析:
\[\begin{aligned}
L_{VLB}&= E_{q(X_{0:T})}\left[\log\frac{q(X_{1:T}|X_{0})}{p_{\theta}(X_{0:T})}\right] \\
&= E_{q(X_{0:T})}\left[\log \frac{\prod_{t=1}^T q(X_{t}\mid X_{t-1})}{p_{\theta}(X_T)\prod_{t=1}^T p_{\theta}(X_{t-1}\mid X_t)} \right]\\
&= E_{q(X_{0:T})}\left[-\log p_{\theta}(X_T) + \sum_{t=1}^T \log \frac{q(X_{t}\mid X_{t-1})}{p_{\theta}(X_{t-1}\mid X_t)} \right]\\
&= E_{q(X_{0:T})}\left[-\log p_{\theta}(X_T) + \sum_{t=2}^T \log \frac{q(X_{t}\mid X_{t-1})}{p_{\theta}(X_{t-1}\mid X_t)} +\log \frac{q(X_{1}\mid X_{0})}{p_{\theta}(X_{0}\mid X_1)} \right]\\
\end{aligned}
\]
\[\begin{aligned}
\text{由马尔科夫性质, 有:}\quad q\left(X_t \mid X_{t-1}\right)&=q\left(X_t \mid X_{t-1}, X_0\right)\\
& =\frac{q\left(X_t, X_{t-1}, X_0\right)}{q\left(X_{t-1}, X_0\right)} \\
& =\frac{q\left(X_{t-1} \mid X_t, X_0\right) q\left(X_t \mid X_0\right)}{q\left(X_{t-1} \mid X_0\right)} \\
\end{aligned}
\]
\[\begin{aligned}
L_{VLB} & =E_{\mathrm{q}}\left[-\log p_\theta\left(X_T\right)+\sum_{t=2}^T \log \left(\frac{q\left(X_{t-1} \mid X_t, X_0\right)}{p_\theta\left(X_{t-1} \mid X_t\right)} \cdot \frac{q\left(X_t \mid X_0\right)}{q\left(X_{t-1} \mid X_0\right)}\right)+\log \frac{q\left(X_1 \mid X_0\right)}{p_\theta\left(X_0 \mid X_1\right)}\right] \\
& =E_{\mathrm{q}}\left[-\log p_\theta\left(X_T\right)+\sum_{t=2}^T \log \frac{q\left(X_{t-1} \mid X_t, X_0\right)}{p_\theta\left(X_{t-1} \mid X_t\right)}+\sum_{t=2}^T \log \frac{q\left(X_t \mid X_0\right)}{q\left(X_{t-1} \mid X_0\right)}+\log \frac{q\left(X_1 \mid X_0\right)}{p_\theta\left(X_0 \mid X_1\right)}\right] \\
& =E_{\mathrm{q}}\left[-\log p_\theta\left(X_T\right)+\sum_{t=2}^T \log \frac{q\left(X_{t-1} \mid X_t, X_0\right)}{p_\theta\left(X_{t-1} \mid X_t\right)}+\log \frac{q\left(X_T \mid X_0\right)}{q\left(X_1 \mid X_0\right)}+\log \frac{q\left(X_1 \mid X_0\right)}{p_\theta\left(X_0 \mid X_1\right)}\right] \\
& =E_{\mathrm{q}}\left[\log \frac{1}{p_\theta\left(X_T\right)}+\sum_{t=2}^T \log \frac{q\left(X_{t-1} \mid X_t, X_0\right)}{p_\theta\left(X_{t-1} \mid X_t\right)}+\log \frac{q\left(X_T \mid X_0\right)}{q\left(X_1 \mid X_0\right)} \cdot \frac{q\left(X_1 \mid X_0\right)}{p_\theta\left(X_0 \mid X_1\right)}\right] \\
& =E_{\mathrm{q}}\left[\log \frac{1}{p_\theta\left(X_T\right)}+\sum_{t=2}^T \log \frac{q\left(X_{t-1} \mid X_t, X_0\right)}{p_\theta\left(X_{t-1} \mid X_t\right)}+\log q\left(X_T \mid X_0\right) +\log\frac{1}{p_\theta\left(X_0 \mid X_1\right)}\right] \\
& =E_{\mathrm{q}}\left[\log \frac{q\left(X_T \mid X_0\right)}{p_\theta\left(X_T\right)}+\sum_{t=2}^T \log \frac{q\left(X_{t-1} \mid X_t, X_0\right)}{p_\theta\left(X_{t-1} \mid X_t\right)}-\log p_\theta\left(X_0 \mid X_1\right)\right] \\
& =E_{\mathrm{q}}\left[D_{KL}\left(q\left(X_T \mid X_0\right) \parallel p_\theta\left(X_T\right)\right)+\sum_{t=2}^T D_{KL}\left(q\left(X_{t-1} \mid X_t, X_0\right) \parallel p_\theta\left(X_{t-1} \mid X_t\right)\right)-\log p_\theta\left(X_0 \mid X_1\right)\right]
\end{aligned}
\]
令 \(L_{VLB} = L_T + L_{T-1} + \ldots + L_0\),其中
\[\begin{aligned}
L_T &= D_{KL}\left(q\left(X_T|X_0\right) \parallel p_\theta\left(X_T\right)\right)\\
L_t &= D_{KL}\left(q\left(X_{t-1}|X_t, X_0\right) \parallel p_\theta\left(X_{t-1}|X_t\right)\right), \quad 1 \leq t \leq T-1\\
L_0 &= -\log p_\theta\left(X_0|X_1\right)
\end{aligned}
\]
接下来分别研究\(L_T,L_t\)和\(L_0\):
-
\(L_T\)不需要进行优化;因为\(q\left(X_T|X_0\right)\)是已知的前向过程,\(p_\theta\left(X_T\right)\)是已知的纯高斯噪声的分布。因此\(L_T\)已知,可以视为一个常数。
-
\(L_0\)也不需要进行优化。DDPM将\(p_\theta\left(X_0|X_1\right)\)设置为了一个固定的过程,是一个从高斯分布中导出的独立的离散形式的编码过程。
对于\(L_t\):
- \(q\left(X_{t-1}|X_t,X_0\right)=N\left(X_{t-1};\tilde{\mu}\left(X_t,X_0\right),\widetilde{\beta}_tI\right)\) 是可以求出来的
- \(p_{\theta}\left(X_{t-1}|X_t\right)=N\left(X_{t-1};\mu_{\theta}\left(X_t,t\right),\Sigma_{\theta}\left(X_t,t\right)\right)\), 是网络期望拟合的目标函数
由两高斯函数的KL散度为:
\[D_{KL}(P\parallel Q)=\log\frac{\sigma_2}{\sigma_1}+\frac{\sigma_1^2+\left(\mu_1-\mu_2\right)^2}{2\sigma_2^2}-\frac{1}{2}
\]
且\(q\left(X_{t-1}|X_t,X_0\right)\)与\(p_{\theta}\left(X_{t-1}|X_t\right)\)的方差都是常数,所以需要优化的是这两个高斯分布的均值的二范数\((\mu_1-\mu_2)^2\),即优化:
\[\begin{aligned}
L_{t} & =E_{q}\left[\left\|\tilde{\mu}\left(X_t,X_{0}\right)-\mu_{\theta}\left(X_t,t\right)\right\|^{2}\right] \\
& =E_{X_{0},\epsilon}\left[\left\|\frac{1}{\sqrt{\alpha_{t}}}\left(X_{t}-\frac{\beta_{t}}{\sqrt{1-\overline{\alpha}_{t}}}\epsilon_{t}\right)-\mu_{\theta}\left(X_t,t\right)\right\|^{2}\right]
\end{aligned}
\]
可以发现\(\mu_{\theta}\left(X_t,t\right)\)的优化目标是尽可能地接近\(\frac{1}{\sqrt{\alpha_{t}}}\left(X_{t}-\frac{\beta_{t}}{\sqrt{1-\overline{\alpha}_{t}}}\epsilon_{t}\right)\)。因为\(X_t\)是\(\mu_\theta\)的输入,在\(t\)时刻是已知的,所以未知量只有\(\epsilon_t\)。因此可以将\(\mu_\theta(X_t,t)\)定义为:
\[\mu_\theta\left(X_t,t\right)=\frac{1}{\sqrt{\alpha_t}}\left(X_t-\frac{\beta_t}{\sqrt{1-\overline{\alpha}_t}}\epsilon_\theta\left(X_t,t\right)\right)
\]
所以有:
\[\begin{aligned}
L_{t} & =E_{X_{0},\epsilon}\left[\left\|\frac{1}{\sqrt{\alpha_{t}}}(X_{t}-\frac{\beta_{t}}{\sqrt{1-\overline{\alpha}_{t}}}\epsilon_{t})-\frac{1}{\sqrt{\alpha_{t}}}(X_{t}-\frac{\beta_{t}}{\sqrt{1-\overline{\alpha}_{t}}}\epsilon_{\theta}(X_t,t))\right\|^{2}\right] \\
& =E_{X_0,\epsilon}\left[\frac{\beta_t^{2}}{\alpha_t(1-\overline{\alpha}_t)}\left\|\epsilon_t-\epsilon_\theta(X_t,t)\right\|^2\right] \\
& \propto E_{X_{0},\epsilon}\left[\left\|\epsilon_{t}-\epsilon_{\theta}(X_t,t)\right\|^{2}\right]
\end{aligned}
\]
再将\(X_t=\sqrt{\overline{\alpha}_t}X_0+\sqrt{1-\overline{\alpha}_t}\epsilon_t\)带入上式
\[L_t=E_{X_0,\epsilon}\left[\left\|\epsilon_t-\epsilon_\theta(\sqrt{\overline{\alpha}_t}X_0+\sqrt{1-\overline{\alpha}_t}\epsilon_t,t)\right\|^2\right]
\]
其中\(\sqrt{\overline{\alpha}_t}X_0+\sqrt{1-\overline{\alpha}_t}\epsilon_t\)其实是一个添加了高斯随机噪声的输入数据,\(\epsilon_\theta(\sqrt{\overline{\alpha}_t}X_0+\sqrt{1-\overline{\alpha}_t}\epsilon_t,t)\) 表示一个输入为\(\sqrt{\overline{\alpha}_t}X_0+\sqrt{1-\overline{\alpha}_t}\epsilon_t\)和\(t\),输出为\(\epsilon_\theta\)的噪声预测网络。所以DDPM网络做的事情其实是估计扩散过程中添加的噪声。
综上,只有\(L_t\)需要被优化。经过复杂的数学推导,DDPM的损失函数其实就是上面的\(L_t\),即需要优化一个L2 loss。
五. 训练与推理
1. 训练流程
- 输入数据:从数据集中采样一个干净的数据样本 $ x_0 $。
- 前向扩散过程:
- 模型预测噪声:
- 对于每个时间步 $ t $,模型 $ N_\theta(x_t, t) $ 预测添加到 $ x_t $ 中的噪声 $ \epsilon_t $。
- 计算损失:
- 损失函数是预测噪声 $ N_\theta(x_t, t) $ 和实际噪声 $ \epsilon_t $ 之间的均方误差(MSE):
\[\mathcal{L}(\theta) = \mathbb{E}_{t, x_0, \epsilon_t} \left[ \left\| \epsilon_t - N_\theta(x_t, t) \right\|^2 \right]
\]
- 反向传播更新参数:
- 通过梯度下降法更新模型参数 $ \theta $,最小化损失函数。
2. 推理(生成)流程
- 初始化:
- 从标准正态分布中采样一个随机噪声 $ x_T \sim \mathcal{N}(0, I) $。
- 反向去噪过程:
- 从 $ t = T $ 开始,逐步去噪生成 $ x_{T-1}, x_{T-2}, \dots, x_0 $。
- 每一步的去噪公式为:
\[x_{t-1} = \frac{1}{\sqrt{1 - \beta_t}} \left( x_t - \frac{\beta_t}{\sqrt{1 - \bar{\alpha}_t}} N_\theta(x_t, t) \right) + \sqrt{\beta_t} \cdot z, \quad z \sim \mathcal{N}(0, I)
\]
其中:
- $ \bar{\alpha}t = \prod^t (1 - \beta_s) $ 是累积噪声调度参数。
- $ N_\theta(x_t, t) $ 是模型预测的噪声。
- $ z $ 是额外添加的噪声,用于保持随机性。
- 生成数据:
- 当 $ t = 0 $ 时,得到生成的数据 $ x_0 $。
3. 训练与推理的对比
| 步骤 |
训练 |
推理 |
| 输入 |
干净数据 $ x_0 $ |
随机噪声 $ x_T \sim \mathcal{N}(0, I) $ |
| 过程 |
前向扩散(添加噪声)+ 模型预测噪声 + 计算损失 |
反向去噪(逐步生成) |
| 目标 |
最小化预测噪声与实际噪声的差异 |
从噪声中生成高质量数据 |
| 时间步 |
从 $ t = 1 $ 到 $ t = T $ |
从 $ t = T $ 到 $ t = 0 $ |
| 模型作用 |
预测每一步添加的噪声 $ \epsilon_t $ |
预测每一步的噪声 $ \epsilon_t $,用于去噪 |
| 输出 |
更新后的模型参数 $ \theta $ |
生成的数据 $ x_0 $ |