这是用户在 2024-8-10 12:54 为 http://127.0.0.1:3000/Volumes/DongYi/Edge_Down/2406.08929v2.html 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?

Step-by-Step Diffusion: An Elementary Tutorial
Preetum Nakkiran 1 1 ^(1){ }^{1}1, Arwen Bradley 1 1 ^(1){ }^{1}1, Hattie Zhou 1 , 2 1 , 2 ^(1,2){ }^{1,2}1,2, Madhu Advani 1 1 ^(1){ }^{1}1
1 1 ^(1){ }^{1}1 Apple, 2 2 ^(2){ }^{2}2 Mila, Université de Montréal
逐步扩散:基础教程 Preetum Nakkiran 1 1 ^(1){ }^{1}1 , Arwen Bradley 1 1 ^(1){ }^{1}1 , Hattie Zhou 1 , 2 1 , 2 ^(1,2){ }^{1,2}1,2 , Madhu Advani 1 1 ^(1){ }^{1}1 1 1 ^(1){ }^{1}1 苹果, 2 2 ^(2){ }^{2}2 Mila, 蒙特利尔大学

We present an accessible first course on diffusion models and flow matching for machine learning, aimed at a technical audience with no diffusion experience. We try to simplify the mathematical details as much as possible (sometimes heuristically), while retaining enough precision to derive correct algorithms.
我们提供了一门易于理解的关于扩散模型和流匹配的机器学习入门课程,面向没有扩散经验的技术受众。我们尽量简化数学细节(有时采用启发式方法),同时保留足够的精确性以推导出正确的算法。

Contents 内容

1 Fundamentals of Diffusion 3
扩散基础 3
1.1 Gaussian Diffusion . . . . . . . . . . . . . . . . . . . . . . 3
1.1 高斯扩散 . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Diffusions in the Abstract . . . . . . . . . . . . . . . . . . 5
1.2 抽象中的扩散 . . . . . . . . . . . . . . . . . . 5
1.3 Discretization . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 离散化 . . . . . . . . . . . . . . . . . . . . . . . . . 6
2 Stochastic Sampling: DDPM 8
随机采样:DDPM 8
2.1 Correctness of DDPM . . . . . . . . . . . . . . . . . . . . 9
2.1 DDPM 的正确性 . . . . . . . . . . . . . . . . . . . . 9
2.2 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2 算法 . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3 Variance Reduction: Predicting x 0 x 0 x_(0)x_{0}x0. . . . . . . . . . . . . 11
2.3 方差减少:预测 x 0 x 0 x_(0)x_{0}x0 . . . . . . . . . . . . . 11
2.4 Diffusions as SDEs [Optional] . . . . . . . . . . . . . . . . 13
2.4 作为随机微分方程的扩散 [可选] . . . . . . . . . . . . . . . . 13
3 Deterministic Sampling: DDIM 16
确定性采样:DDIM 16
3.1 Case 1: Single Point . . . . . . . . . . . . . . . . . . . . . . 16
3.1 案例 1:单点 . . . . . . . . . . . . . . . . . . . . . . 16
3.2 Velocity Fields and Gases . . . . . . . . . . . . . . . . . . 18
3.2 速度场和气体 . . . . . . . . . . . . . . . . . . 18
3.3 Case 2: Two Points . . . . . . . . . . . . . . . . . . . . . . 18
3.3 案例 2:两个点 . . . . . . . . . . . . . . . . . . . . . . 18
3.4 Case 3: Arbitrary Distributions . . . . . . . . . . . . . . . 20
3.4 案例 3:任意分布 . . . . . . . . . . . . . . . 20
3.5 The Probability Flow ODE [Optional] . . . . . . . . . . . 21
3.5 概率流常微分方程 [可选] . . . . . . . . . . . 21
3.6 Discussion: DDPM vs DDIM . . . . . . . . . . . . . . . . 22
3.6 讨论:DDPM 与 DDIM . . . . . . . . . . . . . . . . 22
3.7 Remarks on Generalization . . . . . . . . . . . . . . . . . 23
3.7 关于概括的备注 . . . . . . . . . . . . . . . . . 23
4 Flow Matching 25
4 流量匹配 25
4.1 Flows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.1 流动 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.2 Pointwise Flows . . . . . . . . . . . . . . . . . . . . . . . . 26
4.2 点流 . . . . . . . . . . . . . . . . . . . . . . . . 26
4.3 Marginal Flows . . . . . . . . . . . . . . . . . . . . . . . . 26
4.3 边际流动 . . . . . . . . . . . . . . . . . . . . . . . . 26
4.4 A Simple Choice of Pointwise Flow . . . . . . . . . . . . 27
4.4 点流的简单选择 . . . . . . . . . . . . 27
4.5 Flow Matching . . . . . . . . . . . . . . . . . . . . . . . . 28
4.5 流量匹配 . . . . . . . . . . . . . . . . . . . . . . . . 28
4.6 DDIM as Flow Matching [Optional] . . . . . . . . . . . . 30
4.6 DDIM 作为流匹配 [可选] . . . . . . . . . . . . 30
4.7 Additional Remarks and References [Optional] . . . . . . 3 1 3 1 3^(1)3^{1}31
4.7 附加说明和参考文献 [可选] . . . . . . 3 1 3 1 3^(1)3^{1}31
5 Diffusion in Practice 32
5 实践中的扩散 32
A Additional Resources 36
附加资源 36
B Omitted Derivations 38
B 省略的推导 38

Preface 前言

There are many existing resources for learning diffusion models. Why did we write another? Our goal was to teach diffusion as simply as possible, with minimal mathematical and machine learning prerequisites, but in enough detail to reason about its correctness. Unlike most tutorials on this subject, we take neither a Variational Auto Encoder (VAE) nor an Stochastic Differential Equations (SDE) approach. In fact, for the core ideas we will not need any SDEs, Evidence-Based-Lower-Bounds (ELBOs), Langevin dynamics, or even the notion of a score. The reader need only be familiar with basic probability, calculus, linear algebra, and multivariate Gaussians. The intended audience for this tutorial is technical readers at the level of at least advanced undergraduate or graduate students, who are learning diffusion for the first time and want a mathematical understanding of the subject.
现有许多学习扩散模型的资源。我们为什么还要写另一个?我们的目标是尽可能简单地教授扩散,尽量减少数学和机器学习的前提知识,但又要详细到足以推理其正确性。与大多数关于该主题的教程不同,我们既不采用变分自编码器(VAE)方法,也不采用随机微分方程(SDE)方法。实际上,对于核心思想,我们不需要任何 SDE、基于证据的下界(ELBO)、朗之万动力学,甚至不需要分数的概念。读者只需熟悉基本概率、微积分、线性代数和多元高斯分布即可。该教程的目标受众是至少具备高级本科或研究生水平的技术读者,他们第一次学习扩散,并希望对该主题有数学上的理解。
This tutorial has five parts, each relatively self-contained, but covering closely related topics. Section 1 presents the fundamentals of diffusion: the problem we are trying to solve and an overview of the basic approach. Sections 2 and 3 show how to construct a stochastic and deterministic diffusion sampler, respectively, and give intuitive derivations for why these samplers correctly reverse the forward diffusion process. Section 4 covers the closely-related topic of Flow Matching, which can be thought of as a generalization of diffusion that offers additional flexibility (including what are called rectified flows or linear flows). Finally, in Section 5 we return to diffusion and connect this tutorial to the broader literature while highlighting some of the design choices that matter most in practice, including samplers, noise schedules, and parametrizations.
本教程分为五个部分,每个部分相对独立,但涵盖了密切相关的主题。第一部分介绍扩散的基本原理:我们试图解决的问题以及基本方法的概述。第二部分和第三部分分别展示如何构建随机和确定性扩散采样器,并直观地推导出这些采样器为何能够正确逆转前向扩散过程。第四部分涵盖了密切相关的流匹配主题,可以将其视为扩散的推广,提供额外的灵活性(包括所谓的整流流或线性流)。最后,在第五部分,我们回到扩散,并将本教程与更广泛的文献联系起来,同时强调一些在实践中最重要的设计选择,包括采样器、噪声调度和参数化。

Acknowledgements 致谢

We are grateful for helpful feedback and suggestions from many people, in particular: Josh Susskind, Eugene Ndiaye, Dan Busbridge, Sam Power, De Wang, Russ Webb, Sitan Chen, Vimal Thilak, Etai Littwin, Chenyang Yuan, Alex Schwing, Miguel Angel Bautista Martin, and Dilip Krishnan.
我们感谢许多人提供的有益反馈和建议,特别是:Josh Susskind、Eugene Ndiaye、Dan Busbridge、Sam Power、De Wang、Russ Webb、Sitan Chen、Vimal Thilak、Etai Littwin、Chenyang Yuan、Alex Schwing、Miguel Angel Bautista Martin 和 Dilip Krishnan。

1 Fundamentals of Diffusion
扩散基础知识

The goal of generative modeling is: given i.i.d. samples from some unknown distribution p ( x ) p ( x ) p^(**)(x)p^{*}(x)p(x), construct a sampler for (approximately) the same distribution. For example, given a training set of dog images from some underlying distribution p dog p dog  p_("dog ")p_{\text {dog }}pdog , we want a method of producing new images of dogs from this distribution.
生成建模的目标是:给定来自某个未知分布 p ( x ) p ( x ) p^(**)(x)p^{*}(x)p(x) 的独立同分布样本,构建一个(近似)相同分布的采样器。例如,给定来自某个基础分布 p dog p dog  p_("dog ")p_{\text {dog }}pdog  的狗图像训练集,我们希望有一种方法能够从该分布中生成新的狗图像。
One way to solve this problem, at a high level, is to learn a transformation from some easy-to-sample distribution (such as Gaussian noise) to our target distribution p p p^(**)p^{*}p. Diffusion models offer a general framework for learning such transformations. The clever trick of diffusion is to reduce the problem of sampling from distribution p ( x ) p ( x ) p^(**)(x)p^{*}(x)p(x) into to a sequence of easier sampling problems.
解决这个问题的一种高层次的方法是学习从某个易于采样的分布(例如高斯噪声)到我们的目标分布 p p p^(**)p^{*}p 的变换。扩散模型提供了一个学习这种变换的通用框架。扩散的巧妙之处在于将从分布 p ( x ) p ( x ) p^(**)(x)p^{*}(x)p(x) 中采样的问题简化为一系列更简单的采样问题。
This idea is best explained via the following Gaussian diffusion example. We'll sketch the main ideas now, and in later sections we will use this setup to derive what are commonly known as the DDPM and DDIM samplers 1 1 ^(1){ }^{1}1, and reason about their correctness.
这个想法最好通过以下高斯扩散示例来解释。我们现在将概述主要思想,在后面的部分中,我们将使用这个设置推导出通常称为 DDPM 和 DDIM 采样器的内容 1 1 ^(1){ }^{1}1 ,并推理它们的正确性。

1.1 Gaussian Diffusion 1.1 高斯扩散

For Gaussian diffusion, let x 0 x 0 x_(0)x_{0}x0 be a random variable in R d R d R^(d)\mathbb{R}^{d}Rd distributed according to the target distribution p p p^(**)p^{*}p (e.g., images of dogs). Then construct a sequence of random variables x 1 , x 2 , , x T x 1 , x 2 , , x T x_(1),x_(2),dots,x_(T)x_{1}, x_{2}, \ldots, x_{T}x1,x2,,xT, by successively adding independent Gaussian noise with some small scale σ σ sigma\sigmaσ :
对于高斯扩散,设 x 0 x 0 x_(0)x_{0}x0 为在 R d R d R^(d)\mathbb{R}^{d}Rd 中根据目标分布 p p p^(**)p^{*}p (例如,狗的图像)分布的随机变量。然后通过连续添加一些小尺度 σ σ sigma\sigmaσ 的独立高斯噪声构造随机变量序列 x 1 , x 2 , , x T x 1 , x 2 , , x T x_(1),x_(2),dots,x_(T)x_{1}, x_{2}, \ldots, x_{T}x1,x2,,xT
(1) x t + 1 := x t + η t , η t N ( 0 , σ 2 ) . (1) x t + 1 := x t + η t , η t N 0 , σ 2 . {:(1)x_(t+1):=x_(t)+eta_(t)","quadeta_(t)∼N(0,sigma^(2)).:}\begin{equation*} x_{t+1}:=x_{t}+\eta_{t}, \quad \eta_{t} \sim \mathcal{N}\left(0, \sigma^{2}\right) . \tag{1} \end{equation*}(1)xt+1:=xt+ηt,ηtN(0,σ2).
This is called the forward process 2 2 ^(2){ }^{2}2, which transforms the data distribution into a noise distribution. Equation (1) defines a joint distribution over all ( x 0 , x 1 , , x T ) x 0 , x 1 , , x T (x_(0),x_(1),dots,x_(T))\left(x_{0}, x_{1}, \ldots, x_{T}\right)(x0,x1,,xT), and we let { p t } t [ T ] p t t [ T ] {p_(t)}_(t in[T])\left\{p_{t}\right\}_{t \in[T]}{pt}t[T] denote the marginal distributions of each x t x t x_(t)x_{t}xt. Notice that at large step count T T TTT, the distribution p T p T p_(T)p_{T}pT is nearly Gaussian 3 3 ^(3){ }^{3}3, so we can approximately sample from p T p T p_(T)p_{T}pT by just sampling a Gaussian.
这被称为前向过程 2 2 ^(2){ }^{2}2 ,它将数据分布转化为噪声分布。方程(1)定义了所有 ( x 0 , x 1 , , x T ) x 0 , x 1 , , x T (x_(0),x_(1),dots,x_(T))\left(x_{0}, x_{1}, \ldots, x_{T}\right)(x0,x1,,xT) 的联合分布,我们用 { p t } t [ T ] p t t [ T ] {p_(t)}_(t in[T])\left\{p_{t}\right\}_{t \in[T]}{pt}t[T] 表示每个 x t x t x_(t)x_{t}xt 的边际分布。注意,在大步数 T T TTT 时,分布 p T p T p_(T)p_{T}pT 几乎是高斯分布 3 3 ^(3){ }^{3}3 ,因此我们可以通过简单地从高斯分布中采样来近似地从 p T p T p_(T)p_{T}pT 中采样。
Figure 1: Probability distributions defined by diffusion forward process on one-dimensional target distribution p 0 p 0 p_(0)p_{0}p0.
图 1:由一维目标分布 p 0 p 0 p_(0)p_{0}p0 上的扩散前向过程定义的概率分布。
Now, suppose we can solve the following subproblem:
现在,假设我们可以解决以下子问题:
"Given a sample marginally distributed as p t p t p_(t)p_{t}pt, produce a sample marginally distributed as p t 1 p t 1 p_(t-1)p_{t-1}pt1 ".
"给定一个边际分布为 p t p t p_(t)p_{t}pt 的样本,生成一个边际分布为 p t 1 p t 1 p_(t-1)p_{t-1}pt1 的样本。"
We will call a method that does this a reverse sampler 4 4 ^(4){ }^{4}4, since it tells us how to sample from p t 1 p t 1 p_(t-1)p_{t-1}pt1 assuming we can already sample from p t p t p_(t)p_{t}pt. If we had a reverse sampler, we could sample from our target p 0 p 0 p_(0)p_{0}p0 by simply starting with a Gaussian sample from p T p T p_(T)p_{T}pT, and iteratively applying the reverse sampling procedure to get samples from p T 1 , p T 2 , p T 1 , p T 2 , p_(T-1),p_(T-2),dotsp_{T-1}, p_{T-2}, \ldotspT1,pT2, and finally p 0 = p p 0 = p p_(0)=p^(**)p_{0}=p^{*}p0=p
我们将称这种方法为反向采样器 4 4 ^(4){ }^{4}4 ,因为它告诉我们如何从 p t 1 p t 1 p_(t-1)p_{t-1}pt1 中进行采样,前提是我们已经能够从 p t p t p_(t)p_{t}pt 中进行采样。如果我们有一个反向采样器,我们可以通过简单地从 p T p T p_(T)p_{T}pT 中开始一个高斯样本,并迭代地应用反向采样过程来从我们的目标 p 0 p 0 p_(0)p_{0}p0 中获取样本,最终得到 p T 1 , p T 2 , p T 1 , p T 2 , p_(T-1),p_(T-2),dotsp_{T-1}, p_{T-2}, \ldotspT1,pT2, p 0 = p p 0 = p p_(0)=p^(**)p_{0}=p^{*}p0=p 的样本。
The key insight of diffusion is, learning to reverse each intermediate step can be easier than learning to sample from the target distribution in one step 5 5 ^(5){ }^{5}5. There are many ways to construct reverse samplers, but for concreteness let us first see the standard diffusion sampler which we will call the DDPM sampler 6 6 ^(6){ }^{6}6.
扩散的关键见解是,学习逆转每个中间步骤可能比一步从目标分布中采样更容易 5 5 ^(5){ }^{5}5 。构建逆采样器的方法有很多,但为了具体说明,我们首先来看标准的扩散采样器,我们称之为 DDPM 采样器 6 6 ^(6){ }^{6}6
5 5 ^(5){ }^{5}5 Intuitively this is because the distributions ( p t 1 , p t ) p t 1 , p t (p_(t-1),p_(t))\left(p_{t-1}, p_{t}\right)(pt1,pt) are already quite close, so the reverse sampler does not need to do much.
直观上,这是因为分布 ( p t 1 , p t ) p t 1 , p t (p_(t-1),p_(t))\left(p_{t-1}, p_{t}\right)(pt1,pt) 已经非常接近,因此反向采样器不需要做太多。
The Ideal DDPM sampler uses the obvious strategy: At time t t ttt, given 6 6 ^(6){ }^{6}6 This is the sampling strategy originally input z z zzz (which is promised to be a sample from p t p t p_(t)p_{t}pt ), we output a proposed in Sohl-Dickstein et al. [2015]. sample from the conditional distribution
理想的 DDPM 采样器使用明显的策略:在时间 t t ttt ,给定 6 6 ^(6){ }^{6}6 。这是最初输入的采样策略 z z zzz (承诺是来自 p t p t p_(t)p_{t}pt 的样本),我们输出一个来自 Sohl-Dickstein 等人[2015]的条件分布的提议样本。
(2) p ( x t 1 x t = z ) (2) p x t 1 x t = z {:(2)p(x_(t-1)∣x_(t)=z):}\begin{equation*} p\left(x_{t-1} \mid x_{t}=z\right) \tag{2} \end{equation*}(2)p(xt1xt=z)
This is clearly a correct reverse sampler. The problem is, it requires learning a generative model for the conditional distribution p ( x t 1 x t ) p x t 1 x t p(x_(t-1)∣x_(t))p\left(x_{t-1} \mid x_{t}\right)p(xt1xt) for every x t x t x_(t)x_{t}xt, which could be complicated. But if the per-step noise σ σ sigma\sigmaσ is sufficiently small, then it turns out this conditional distribution becomes simple:
这显然是一个正确的反向采样器。问题在于,它需要为每个 x t x t x_(t)x_{t}xt 学习条件分布 p ( x t 1 x t ) p x t 1 x t p(x_(t-1)∣x_(t))p\left(x_{t-1} \mid x_{t}\right)p(xt1xt) 的生成模型,这可能会很复杂。但是如果每一步的噪声 σ σ sigma\sigmaσ 足够小,那么这个条件分布就变得简单。
Fact 1 (Diffusion Reverse Process). For small σ σ sigma\sigmaσ, and the Gaussian diffusion process defined in ( 1 ) ( 1 ) (1)(1)(1), the conditional distribution p ( x t 1 x t ) p x t 1 x t p(x_(t-1)∣x_(t))p\left(x_{t-1} \mid x_{t}\right)p(xt1xt) is itself close to Gaussian. That is, for all times t t ttt and conditionings z R d z R d z inR^(d)z \in \mathbb{R}^{d}zRd, there exists some mean parameter μ R d μ R d mu inR^(d)\mu \in \mathbb{R}^{d}μRd such that
事实 1(扩散逆过程)。对于小的 σ σ sigma\sigmaσ ,以及在 ( 1 ) ( 1 ) (1)(1)(1) 中定义的高斯扩散过程,条件分布 p ( x t 1 x t ) p x t 1 x t p(x_(t-1)∣x_(t))p\left(x_{t-1} \mid x_{t}\right)p(xt1xt) 本身接近高斯分布。也就是说,对于所有时间 t t ttt 和条件 z R d z R d z inR^(d)z \in \mathbb{R}^{d}zRd ,存在某个均值参数 μ R d μ R d mu inR^(d)\mu \in \mathbb{R}^{d}μRd ,使得
(3) p ( x t 1 x t = z ) N ( x t 1 ; μ , σ 2 ) (3) p x t 1 x t = z N x t 1 ; μ , σ 2 {:(3)p(x_(t-1)∣x_(t)=z)~~N(x_(t-1);mu,sigma^(2)):}\begin{equation*} p\left(x_{t-1} \mid x_{t}=z\right) \approx \mathcal{N}\left(x_{t-1} ; \mu, \sigma^{2}\right) \tag{3} \end{equation*}(3)p(xt1xt=z)N(xt1;μ,σ2)
This is not an obvious fact; we will derive it in Section 2.1. This fact enables a drastic simplification: instead of having to learn an
这并不是一个显而易见的事实;我们将在第 2.1 节中推导它。这个事实使得大幅简化成为可能:不必学习一个
Figure 2: Illustration of Fact 1. The prior distribution p ( x t 1 ) p x t 1 p(x_(t-1))p\left(x_{t-1}\right)p(xt1), leftmost, defines a joint distribution ( x t 1 , x t ) x t 1 , x t (x_(t-1),x_(t))\left(x_{t-1}, x_{t}\right)(xt1,xt) where p ( x t x t 1 ) = N ( 0 , σ 2 ) p x t x t 1 = N 0 , σ 2 p(x_(t)∣x_(t-1))=N(0,sigma^(2))p\left(x_{t} \mid x_{t-1}\right)=\mathcal{N}\left(0, \sigma^{2}\right)p(xtxt1)=N(0,σ2). We plot the reverse conditional distributions p ( x t 1 x t ) p x t 1 x t p(x_(t-1)∣x_(t))p\left(x_{t-1} \mid x_{t}\right)p(xt1xt) for a fixed conditioning x t x t x_(t)x_{t}xt, and varying noise levels σ σ sigma\sigmaσ. Notice these distributions become close to Gaussian for small σ σ sigma\sigmaσ.
图 2:事实 1 的示意图。最左侧的先验分布 p ( x t 1 ) p x t 1 p(x_(t-1))p\left(x_{t-1}\right)p(xt1) 定义了一个联合分布 ( x t 1 , x t ) x t 1 , x t (x_(t-1),x_(t))\left(x_{t-1}, x_{t}\right)(xt1,xt) ,其中 p ( x t x t 1 ) = N ( 0 , σ 2 ) p x t x t 1 = N 0 , σ 2 p(x_(t)∣x_(t-1))=N(0,sigma^(2))p\left(x_{t} \mid x_{t-1}\right)=\mathcal{N}\left(0, \sigma^{2}\right)p(xtxt1)=N(0,σ2) 。我们绘制了固定条件 x t x t x_(t)x_{t}xt 下的反向条件分布 p ( x t 1 x t ) p x t 1 x t p(x_(t-1)∣x_(t))p\left(x_{t-1} \mid x_{t}\right)p(xt1xt) ,并且噪声水平 σ σ sigma\sigmaσ 变化。注意,当 σ σ sigma\sigmaσ 较小时,这些分布接近高斯分布。

arbitrary distribution p ( x t 1 x t ) p x t 1 x t p(x_(t-1)∣x_(t))p\left(x_{t-1} \mid x_{t}\right)p(xt1xt) from scratch, we now know everything about this distribution except its mean, which we denote 7 7 ^(7){ }^{7}7 μ t 1 ( x t ) μ t 1 x t mu_(t-1)(x_(t))\mu_{t-1}\left(x_{t}\right)μt1(xt). The fact that we can approximate the posterior distribution as Gaussian when σ σ sigma\sigmaσ is sufficiently small is illustrated in Fig 2. This is an important point, so to re-iterate: for a given time t t ttt and conditioning value x t x t x_(t)x_{t}xt, learning the mean of p ( x t 1 x t ) p x t 1 x t p(x_(t-1)∣x_(t))p\left(x_{t-1} \mid x_{t}\right)p(xt1xt) is sufficient to learn the full conditional distribution p ( x t 1 x t ) p x t 1 x t p(x_(t-1)∣x_(t))p\left(x_{t-1} \mid x_{t}\right)p(xt1xt).
从头开始任意分布 p ( x t 1 x t ) p x t 1 x t p(x_(t-1)∣x_(t))p\left(x_{t-1} \mid x_{t}\right)p(xt1xt) ,我们现在对该分布的所有信息都已了解,除了它的均值,我们用 7 7 ^(7){ }^{7}7 μ t 1 ( x t ) μ t 1 x t mu_(t-1)(x_(t))\mu_{t-1}\left(x_{t}\right)μt1(xt) 表示。图 2 展示了当 σ σ sigma\sigmaσ 足够小时,我们可以将后验分布近似为高斯分布。这是一个重要的观点,因此重申一下:对于给定的时间 t t ttt 和条件值 x t x t x_(t)x_{t}xt ,学习 p ( x t 1 x t ) p x t 1 x t p(x_(t-1)∣x_(t))p\left(x_{t-1} \mid x_{t}\right)p(xt1xt) 的均值足以学习完整的条件分布 p ( x t 1 x t ) p x t 1 x t p(x_(t-1)∣x_(t))p\left(x_{t-1} \mid x_{t}\right)p(xt1xt)
Learning the mean of p ( x t 1 x t ) p x t 1 x t p(x_(t-1)∣x_(t))p\left(x_{t-1} \mid x_{t}\right)p(xt1xt) is a much simpler problem than learning the full conditional distribution, because we can solve it by regression. To elaborate, we have a joint distribution ( x t 1 , x t ) x t 1 , x t (x_(t-1),x_(t))\left(x_{t-1}, x_{t}\right)(xt1,xt) from which we can easily sample, and we would like to estimate E [ x t 1 x t ] E x t 1 x t E[x_(t-1)∣x_(t)]\mathbb{E}\left[x_{t-1} \mid x_{t}\right]E[xt1xt]. This can be done by optimizing a standard regression loss 8 8 ^(8)^{8}8 :
学习 p ( x t 1 x t ) p x t 1 x t p(x_(t-1)∣x_(t))p\left(x_{t-1} \mid x_{t}\right)p(xt1xt) 的均值是一个比学习完整条件分布简单得多的问题,因为我们可以通过回归来解决它。具体来说,我们有一个联合分布 ( x t 1 , x t ) x t 1 , x t (x_(t-1),x_(t))\left(x_{t-1}, x_{t}\right)(xt1,xt) ,可以轻松地从中进行采样,我们希望估计 E [ x t 1 x t ] E x t 1 x t E[x_(t-1)∣x_(t)]\mathbb{E}\left[x_{t-1} \mid x_{t}\right]E[xt1xt] 。这可以通过优化标准回归损失 8 8 ^(8)^{8}8 来完成:
(4) μ t 1 ( z ) := E [ x t 1 x t = z ] (5) μ t 1 = argmin f : R d R d E x t , x t 1 f ( x t ) x t 1 2 2 (6) = argmin f : R d R d x t 1 , η E f ( x t 1 + η t ) x t 1 ) 2 2 , (4) μ t 1 ( z ) := E x t 1 x t = z (5) μ t 1 = argmin f : R d R d E x t , x t 1 f x t x t 1 2 2 (6) = argmin f : R d R d x t 1 , η E f x t 1 + η t x t 1 2 2 , {:[(4)mu_(t-1)(z):=E[x_(t-1)∣x_(t)=z]],[(5) Longrightarrowmu_(t-1)=argmin_(f:R^(d)rarrR^(d))E_(x_(t),x_(t-1))||f(x_(t))-x_(t-1)||_(2)^(2)],[(6){:=argmin_(f:R^(d)rarrR^(d))x_(t-1,eta)^(E)||f(x_(t-1)+eta_(t))-x_(t-1))||_(2)^(2)","]:}\begin{align*} & \mu_{t-1}(z):=\mathbb{E}\left[x_{t-1} \mid x_{t}=z\right] \tag{4}\\ & \Longrightarrow \mu_{t-1}=\underset{f: \mathbb{R}^{d} \rightarrow \mathbb{R}^{d}}{\operatorname{argmin}} \underset{x_{t}, x_{t-1}}{\mathbb{E}}\left\|f\left(x_{t}\right)-x_{t-1}\right\|_{2}^{2} \tag{5}\\ & \left.=\underset{f: \mathbb{R}^{d} \rightarrow \mathbb{R}^{d}}{\operatorname{argmin}} x_{t-1, \eta}^{\mathbb{E}} \| f\left(x_{t-1}+\eta_{t}\right)-x_{t-1}\right) \|_{2}^{2}, \tag{6} \end{align*}(4)μt1(z):=E[xt1xt=z](5)μt1=argminf:RdRdExt,xt1f(xt)xt122(6)=argminf:RdRdxt1,ηEf(xt1+ηt)xt1)22,
where the expectation is taken over samples x 0 x 0 x_(0)x_{0}x0 from our target distribution p .9 p .9 p^(**).9p^{*} .9p.9 This particular regression problem is well-studied in certain settings. For example, when the target p p p^(**)p^{*}p is a distribution on images, then the corresponding regression problem (Equation 6) is exactly an image denoising objective, which can be approached with familiar methods (e.g. convolutional neural networks).
在这里,期望是针对来自我们目标分布 p .9 p .9 p^(**).9p^{*} .9p.9 的样本 x 0 x 0 x_(0)x_{0}x0 进行的。这个特定的回归问题在某些环境中得到了充分研究。例如,当目标 p p p^(**)p^{*}p 是图像上的分布时,相应的回归问题(方程 6)恰好是一个图像去噪目标,可以通过熟悉的方法(例如卷积神经网络)来处理。
Stepping back, we have seen something remarkable: we have reduced the problem of learning to sample from an arbitrary distribution to the standard problem of regression.
退一步来看,我们看到了一些显著的事情:我们将从任意分布中学习的样本问题简化为标准的回归问题。

1.2 Diffusions in the Abstract
1.2 抽象中的扩散

Let us now abstract away the Gaussian setting, to define diffusionlike models in a way that will capture their many instantiations (including deterministic samplers, discrete domains, and flowmatching).
现在让我们抽象掉高斯设置,以一种能够捕捉其多种实例(包括确定性采样器、离散域和流匹配)来定义扩散类模型。
Abstractly, here is how to construct a diffusion-like generative model: We start with our target distribution p p p^(**)p^{*}p, and we pick some base distribution q ( x ) q ( x ) q(x)q(x)q(x) which is easy to sample from, e.g. a standard Gaussian or i.i.d bits. We then try to construct a sequence of distributions which interpolate between our target p p p^(**)p^{*}p and the base distribution q q qqq. That is, we construct distributions
抽象地说,构建一个类似扩散的生成模型的方法如下:我们从目标分布 p p p^(**)p^{*}p 开始,然后选择一个易于采样的基础分布 q ( x ) q ( x ) q(x)q(x)q(x) ,例如标准高斯分布或独立同分布的比特。接着,我们尝试构建一个分布序列,该序列在目标分布 p p p^(**)p^{*}p 和基础分布 q q qqq 之间进行插值。也就是说,我们构建分布。
(7) p 0 , p 1 , p 2 , , p T (7) p 0 , p 1 , p 2 , , p T {:(7)p_(0)","p_(1)","p_(2)","dots","p_(T):}\begin{equation*} p_{0}, p_{1}, p_{2}, \ldots, p_{T} \tag{7} \end{equation*}(7)p0,p1,p2,,pT
7 7 ^(7){ }^{7}7 We denote the mean as a function μ t 1 : R d R d μ t 1 : R d R d mu_(t-1):R^(d)rarrR^(d)\mu_{t-1}: \mathbb{R}^{d} \rightarrow \mathbb{R}^{d}μt1:RdRd because the mean of p ( x t 1 x t ) p x t 1 x t p(x_(t-1)∣x_(t))p\left(x_{t-1} \mid x_{t}\right)p(xt1xt) depends on the time t t ttt as well as the conditioning x t x t x_(t)x_{t}xt, as described in Fact 1 .
我们将均值表示为一个函数 μ t 1 : R d R d μ t 1 : R d R d mu_(t-1):R^(d)rarrR^(d)\mu_{t-1}: \mathbb{R}^{d} \rightarrow \mathbb{R}^{d}μt1:RdRd ,因为 p ( x t 1 x t ) p x t 1 x t p(x_(t-1)∣x_(t))p\left(x_{t-1} \mid x_{t}\right)p(xt1xt) 的均值依赖于时间 t t ttt 以及条件 x t x t x_(t)x_{t}xt ,如事实 1 所述。
8 8 ^(8){ }^{8}8 Recall the generic fact that for any distribution over ( x , y ) ( x , y ) (x,y)(x, y)(x,y), we have: argmin f E f ( x ) y 2 = E [ y x ] argmin f E f ( x ) y 2 = E [ y x ] argmin_(f)E||f(x)-y||^(2)=E[y∣x]\operatorname{argmin}_{f} \mathbb{E}\|f(x)-y\|^{2}=\mathbb{E}[y \mid x]argminfEf(x)y2=E[yx]
回忆一下,对于任何在 ( x , y ) ( x , y ) (x,y)(x, y)(x,y) 上的分布,我们有: argmin f E f ( x ) y 2 = E [ y x ] argmin f E f ( x ) y 2 = E [ y x ] argmin_(f)E||f(x)-y||^(2)=E[y∣x]\operatorname{argmin}_{f} \mathbb{E}\|f(x)-y\|^{2}=\mathbb{E}[y \mid x]argminfEf(x)y2=E[yx]
9 9 ^(9){ }^{9}9 Notice that we simulate samples of ( x t 1 , x t ) x t 1 , x t (x_(t-1),x_(t))\left(x_{t-1}, x_{t}\right)(xt1,xt) by adding noise to the samples of x 0 x 0 x_(0)x_{0}x0, as defined in Equation 1.
注意,我们通过向 x 0 x 0 x_(0)x_{0}x0 的样本添加噪声来模拟 ( x t 1 , x t ) x t 1 , x t (x_(t-1),x_(t))\left(x_{t-1}, x_{t}\right)(xt1,xt) 的样本,如方程 1 所定义。

such that p 0 = p p 0 = p p_(0)=p^(**)p_{0}=p^{*}p0=p is our target, p T = q p T = q p_(T)=qp_{T}=qpT=q the base distribution, and adjacent distributions ( p t 1 , p t ) p t 1 , p t (p_(t-1),p_(t))\left(p_{t-1}, p_{t}\right)(pt1,pt) are marginally "close" in some appropriate sense. Then, we learn a reverse sampler which transforms distributions p t p t p_(t)p_{t}pt to p t 1 p t 1 p_(t-1)p_{t-1}pt1. This is the key learning step, which presumably is made easier by the fact that adjacent distributions are "close." Formally, reverse samplers are defined below.
使得 p 0 = p p 0 = p p_(0)=p^(**)p_{0}=p^{*}p0=p 是我们的目标, p T = q p T = q p_(T)=qp_{T}=qpT=q 是基础分布,而相邻分布 ( p t 1 , p t ) p t 1 , p t (p_(t-1),p_(t))\left(p_{t-1}, p_{t}\right)(pt1,pt) 在某种适当的意义上是边际上“接近”的。然后,我们学习一个反向采样器,它将分布 p t p t p_(t)p_{t}pt 转换为 p t 1 p t 1 p_(t-1)p_{t-1}pt1 。这是关键的学习步骤,假设由于相邻分布“接近”,这一过程变得更容易。正式地,反向采样器的定义如下。
Definition 1 (Reverse Sampler). Given a sequence of marginal distributions p t p t p_(t)p_{t}pt, a reverse sampler for step t t ttt is a potentially stochastic function F t F t F_(t)F_{t}Ft such that if x t p t x t p t x_(t)∼p_(t)x_{t} \sim p_{t}xtpt, then the marginal distribution of F t ( x t ) F t x t F_(t)(x_(t))F_{t}\left(x_{t}\right)Ft(xt) is exactly p t 1 p t 1 p_(t-1)p_{t-1}pt1 :
定义 1(反向采样器)。给定一系列边际分布 p t p t p_(t)p_{t}pt ,步骤 t t ttt 的反向采样器是一个潜在的随机函数 F t F t F_(t)F_{t}Ft ,使得如果 x t p t x t p t x_(t)∼p_(t)x_{t} \sim p_{t}xtpt ,则 F t ( x t ) F t x t F_(t)(x_(t))F_{t}\left(x_{t}\right)Ft(xt) 的边际分布恰好是 p t 1 p t 1 p_(t-1)p_{t-1}pt1
(8) { F t ( z ) : z p t } p t 1 (8) F t ( z ) : z p t p t 1 {:(8){F_(t)(z):z∼p_(t)}-=p_(t-1):}\begin{equation*} \left\{F_{t}(z): z \sim p_{t}\right\} \equiv p_{t-1} \tag{8} \end{equation*}(8){Ft(z):zpt}pt1
There are many possible reverse samplers 10 10 ^(10){ }^{10}10, and it is even possible to construct reverse samplers which are deterministic. In the remainder of this tutorial we will see three popular reverse samplers more formally: the DDPM sampler discussed above (Section 2.1), the DDIM sampler (Section 3), which is deterministic, and the family of flow-matching models (Section 4), which can be thought of as a generalization of DDIM. 11 11 ^(11){ }^{11}11
有许多可能的反向采样器 10 10 ^(10){ }^{10}10 ,甚至可以构造出确定性的反向采样器。在本教程的其余部分,我们将更正式地介绍三种流行的反向采样器:上述讨论的 DDPM 采样器(第 2.1 节)、确定性的 DDIM 采样器(第 3 节)以及流匹配模型家族(第 4 节),可以将其视为 DDIM 的推广。 11 11 ^(11){ }^{11}11

1.3 Discretization 1.3 离散化

Before we proceed further, we need to be more precise about what we mean by adjacent distributions p t , p t 1 p t , p t 1 p_(t),p_(t-1)p_{t}, p_{t-1}pt,pt1 being "close". We want to think of the sequence p 0 , p 1 , , p T p 0 , p 1 , , p T p_(0),p_(1),dots,p_(T)p_{0}, p_{1}, \ldots, p_{T}p0,p1,,pT as the discretization of some (well-behaved) time-evolving function p ( x , t ) p ( x , t ) p(x,t)p(x, t)p(x,t), that starts from the target distribution p 0 p 0 p_(0)p_{0}p0 at time t = 0 t = 0 t=0t=0t=0 and ends at the noisy distribution p T p T p_(T)p_{T}pT at time t = 1 t = 1 t=1t=1t=1 :
在我们进一步讨论之前,我们需要更准确地定义相邻分布 p t , p t 1 p t , p t 1 p_(t),p_(t-1)p_{t}, p_{t-1}pt,pt1 “接近”的含义。我们希望将序列 p 0 , p 1 , , p T p 0 , p 1 , , p T p_(0),p_(1),dots,p_(T)p_{0}, p_{1}, \ldots, p_{T}