这是用户在 2024-8-10 12:54 为 http://127.0.0.1:3000/Volumes/DongYi/Edge_Down/2406.08929v2.html 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?

Step-by-Step Diffusion: An Elementary Tutorial
Preetum Nakkiran 1 1 ^(1){ }^{1}1, Arwen Bradley 1 1 ^(1){ }^{1}1, Hattie Zhou 1 , 2 1 , 2 ^(1,2){ }^{1,2}1,2, Madhu Advani 1 1 ^(1){ }^{1}1
1 1 ^(1){ }^{1}1 Apple, 2 2 ^(2){ }^{2}2 Mila, Université de Montréal
逐步扩散:基础教程 Preetum Nakkiran 1 1 ^(1){ }^{1}1 , Arwen Bradley 1 1 ^(1){ }^{1}1 , Hattie Zhou 1 , 2 1 , 2 ^(1,2){ }^{1,2}1,2 , Madhu Advani 1 1 ^(1){ }^{1}1 1 1 ^(1){ }^{1}1 苹果, 2 2 ^(2){ }^{2}2 Mila, 蒙特利尔大学

We present an accessible first course on diffusion models and flow matching for machine learning, aimed at a technical audience with no diffusion experience. We try to simplify the mathematical details as much as possible (sometimes heuristically), while retaining enough precision to derive correct algorithms.
我们提供了一门易于理解的关于扩散模型和流匹配的机器学习入门课程,面向没有扩散经验的技术受众。我们尽量简化数学细节(有时采用启发式方法),同时保留足够的精确性以推导出正确的算法。

Contents 内容

1 Fundamentals of Diffusion 3
扩散基础 3
1.1 Gaussian Diffusion . . . . . . . . . . . . . . . . . . . . . . 3
1.1 高斯扩散 . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Diffusions in the Abstract . . . . . . . . . . . . . . . . . . 5
1.2 抽象中的扩散 . . . . . . . . . . . . . . . . . . 5
1.3 Discretization . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 离散化 . . . . . . . . . . . . . . . . . . . . . . . . . 6
2 Stochastic Sampling: DDPM 8
随机采样:DDPM 8
2.1 Correctness of DDPM . . . . . . . . . . . . . . . . . . . . 9
2.1 DDPM 的正确性 . . . . . . . . . . . . . . . . . . . . 9
2.2 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2 算法 . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3 Variance Reduction: Predicting x 0 x 0 x_(0)x_{0}x0. . . . . . . . . . . . . 11
2.3 方差减少:预测 x 0 x 0 x_(0)x_{0}x0 . . . . . . . . . . . . . 11
2.4 Diffusions as SDEs [Optional] . . . . . . . . . . . . . . . . 13
2.4 作为随机微分方程的扩散 [可选] . . . . . . . . . . . . . . . . 13
3 Deterministic Sampling: DDIM 16
确定性采样:DDIM 16
3.1 Case 1: Single Point . . . . . . . . . . . . . . . . . . . . . . 16
3.1 案例 1:单点 . . . . . . . . . . . . . . . . . . . . . . 16
3.2 Velocity Fields and Gases . . . . . . . . . . . . . . . . . . 18
3.2 速度场和气体 . . . . . . . . . . . . . . . . . . 18
3.3 Case 2: Two Points . . . . . . . . . . . . . . . . . . . . . . 18
3.3 案例 2:两个点 . . . . . . . . . . . . . . . . . . . . . . 18
3.4 Case 3: Arbitrary Distributions . . . . . . . . . . . . . . . 20
3.4 案例 3:任意分布 . . . . . . . . . . . . . . . 20
3.5 The Probability Flow ODE [Optional] . . . . . . . . . . . 21
3.5 概率流常微分方程 [可选] . . . . . . . . . . . 21
3.6 Discussion: DDPM vs DDIM . . . . . . . . . . . . . . . . 22
3.6 讨论:DDPM 与 DDIM . . . . . . . . . . . . . . . . 22
3.7 Remarks on Generalization . . . . . . . . . . . . . . . . . 23
3.7 关于概括的备注 . . . . . . . . . . . . . . . . . 23
4 Flow Matching 25
4 流量匹配 25
4.1 Flows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.1 流动 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.2 Pointwise Flows . . . . . . . . . . . . . . . . . . . . . . . . 26
4.2 点流 . . . . . . . . . . . . . . . . . . . . . . . . 26
4.3 Marginal Flows . . . . . . . . . . . . . . . . . . . . . . . . 26
4.3 边际流动 . . . . . . . . . . . . . . . . . . . . . . . . 26
4.4 A Simple Choice of Pointwise Flow . . . . . . . . . . . . 27
4.4 点流的简单选择 . . . . . . . . . . . . 27
4.5 Flow Matching . . . . . . . . . . . . . . . . . . . . . . . . 28
4.5 流量匹配 . . . . . . . . . . . . . . . . . . . . . . . . 28
4.6 DDIM as Flow Matching [Optional] . . . . . . . . . . . . 30
4.6 DDIM 作为流匹配 [可选] . . . . . . . . . . . . 30
4.7 Additional Remarks and References [Optional] . . . . . . 3 1 3 1 3^(1)3^{1}31
4.7 附加说明和参考文献 [可选] . . . . . . 3 1 3 1 3^(1)3^{1}31
5 Diffusion in Practice 32
5 实践中的扩散 32
A Additional Resources 36
附加资源 36
B Omitted Derivations 38
B 省略的推导 38

Preface 前言

There are many existing resources for learning diffusion models. Why did we write another? Our goal was to teach diffusion as simply as possible, with minimal mathematical and machine learning prerequisites, but in enough detail to reason about its correctness. Unlike most tutorials on this subject, we take neither a Variational Auto Encoder (VAE) nor an Stochastic Differential Equations (SDE) approach. In fact, for the core ideas we will not need any SDEs, Evidence-Based-Lower-Bounds (ELBOs), Langevin dynamics, or even the notion of a score. The reader need only be familiar with basic probability, calculus, linear algebra, and multivariate Gaussians. The intended audience for this tutorial is technical readers at the level of at least advanced undergraduate or graduate students, who are learning diffusion for the first time and want a mathematical understanding of the subject.
现有许多学习扩散模型的资源。我们为什么还要写另一个?我们的目标是尽可能简单地教授扩散,尽量减少数学和机器学习的前提知识,但又要详细到足以推理其正确性。与大多数关于该主题的教程不同,我们既不采用变分自编码器(VAE)方法,也不采用随机微分方程(SDE)方法。实际上,对于核心思想,我们不需要任何 SDE、基于证据的下界(ELBO)、朗之万动力学,甚至不需要分数的概念。读者只需熟悉基本概率、微积分、线性代数和多元高斯分布即可。该教程的目标受众是至少具备高级本科或研究生水平的技术读者,他们第一次学习扩散,并希望对该主题有数学上的理解。
This tutorial has five parts, each relatively self-contained, but covering closely related topics. Section 1 presents the fundamentals of diffusion: the problem we are trying to solve and an overview of the basic approach. Sections 2 and 3 show how to construct a stochastic and deterministic diffusion sampler, respectively, and give intuitive derivations for why these samplers correctly reverse the forward diffusion process. Section 4 covers the closely-related topic of Flow Matching, which can be thought of as a generalization of diffusion that offers additional flexibility (including what are called rectified flows or linear flows). Finally, in Section 5 we return to diffusion and connect this tutorial to the broader literature while highlighting some of the design choices that matter most in practice, including samplers, noise schedules, and parametrizations.
本教程分为五个部分,每个部分相对独立,但涵盖了密切相关的主题。第一部分介绍扩散的基本原理:我们试图解决的问题以及基本方法的概述。第二部分和第三部分分别展示如何构建随机和确定性扩散采样器,并直观地推导出这些采样器为何能够正确逆转前向扩散过程。第四部分涵盖了密切相关的流匹配主题,可以将其视为扩散的推广,提供额外的灵活性(包括所谓的整流流或线性流)。最后,在第五部分,我们回到扩散,并将本教程与更广泛的文献联系起来,同时强调一些在实践中最重要的设计选择,包括采样器、噪声调度和参数化。

Acknowledgements 致谢

We are grateful for helpful feedback and suggestions from many people, in particular: Josh Susskind, Eugene Ndiaye, Dan Busbridge, Sam Power, De Wang, Russ Webb, Sitan Chen, Vimal Thilak, Etai Littwin, Chenyang Yuan, Alex Schwing, Miguel Angel Bautista Martin, and Dilip Krishnan.
我们感谢许多人提供的有益反馈和建议,特别是:Josh Susskind、Eugene Ndiaye、Dan Busbridge、Sam Power、De Wang、Russ Webb、Sitan Chen、Vimal Thilak、Etai Littwin、Chenyang Yuan、Alex Schwing、Miguel Angel Bautista Martin 和 Dilip Krishnan。

1 Fundamentals of Diffusion
扩散基础知识

The goal of generative modeling is: given i.i.d. samples from some unknown distribution p ( x ) p ( x ) p^(**)(x)p^{*}(x)p(x), construct a sampler for (approximately) the same distribution. For example, given a training set of dog images from some underlying distribution p dog p dog  p_("dog ")p_{\text {dog }}pdog , we want a method of producing new images of dogs from this distribution.
生成建模的目标是:给定来自某个未知分布 p ( x ) p ( x ) p^(**)(x)p^{*}(x)p(x) 的独立同分布样本,构建一个(近似)相同分布的采样器。例如,给定来自某个基础分布 p dog p dog  p_("dog ")p_{\text {dog }}pdog  的狗图像训练集,我们希望有一种方法能够从该分布中生成新的狗图像。
One way to solve this problem, at a high level, is to learn a transformation from some easy-to-sample distribution (such as Gaussian noise) to our target distribution p p p^(**)p^{*}p. Diffusion models offer a general framework for learning such transformations. The clever trick of diffusion is to reduce the problem of sampling from distribution p ( x ) p ( x ) p^(**)(x)p^{*}(x)p(x) into to a sequence of easier sampling problems.
解决这个问题的一种高层次的方法是学习从某个易于采样的分布(例如高斯噪声)到我们的目标分布 p p p^(**)p^{*}p 的变换。扩散模型提供了一个学习这种变换的通用框架。扩散的巧妙之处在于将从分布 p ( x ) p ( x ) p^(**)(x)p^{*}(x)p(x) 中采样的问题简化为一系列更简单的采样问题。
This idea is best explained via the following Gaussian diffusion example. We'll sketch the main ideas now, and in later sections we will use this setup to derive what are commonly known as the DDPM and DDIM samplers 1 1 ^(1){ }^{1}1, and reason about their correctness.
这个想法最好通过以下高斯扩散示例来解释。我们现在将概述主要思想,在后面的部分中,我们将使用这个设置推导出通常称为 DDPM 和 DDIM 采样器的内容 1 1 ^(1){ }^{1}1 ,并推理它们的正确性。

1.1 Gaussian Diffusion 1.1 高斯扩散

For Gaussian diffusion, let x 0 x 0 x_(0)x_{0}x0 be a random variable in R d R d R^(d)\mathbb{R}^{d}Rd distributed according to the target distribution p p p^(**)p^{*}p (e.g., images of dogs). Then construct a sequence of random variables x 1 , x 2 , , x T x 1 , x 2 , , x T x_(1),x_(2),dots,x_(T)x_{1}, x_{2}, \ldots, x_{T}x1,x2,,xT, by successively adding independent Gaussian noise with some small scale σ σ sigma\sigmaσ :
对于高斯扩散,设 x 0 x 0 x_(0)x_{0}x0 为在 R d R d R^(d)\mathbb{R}^{d}Rd 中根据目标分布 p p p^(**)p^{*}p (例如,狗的图像)分布的随机变量。然后通过连续添加一些小尺度 σ σ sigma\sigmaσ 的独立高斯噪声构造随机变量序列 x 1 , x 2 , , x T x 1 , x 2 , , x T x_(1),x_(2),dots,x_(T)x_{1}, x_{2}, \ldots, x_{T}x1,x2,,xT
(1) x t + 1 := x t + η t , η t N ( 0 , σ 2 ) . (1) x t + 1 := x t + η t , η t N 0 , σ 2 . {:(1)x_(t+1):=x_(t)+eta_(t)","quadeta_(t)∼N(0,sigma^(2)).:}\begin{equation*} x_{t+1}:=x_{t}+\eta_{t}, \quad \eta_{t} \sim \mathcal{N}\left(0, \sigma^{2}\right) . \tag{1} \end{equation*}(1)xt+1:=xt+ηt,ηtN(0,σ2).
This is called the forward process 2 2 ^(2){ }^{2}2, which transforms the data distribution into a noise distribution. Equation (1) defines a joint distribution over all ( x 0 , x 1 , , x T ) x 0 , x 1 , , x T (x_(0),x_(1),dots,x_(T))\left(x_{0}, x_{1}, \ldots, x_{T}\right)(x0,x1,,xT), and we let { p t } t [ T ] p t t [ T ] {p_(t)}_(t in[T])\left\{p_{t}\right\}_{t \in[T]}{pt}t[T] denote the marginal distributions of each x t x t x_(t)x_{t}xt. Notice that at large step count T T TTT, the distribution p T p T p_(T)p_{T}pT is nearly Gaussian 3 3 ^(3){ }^{3}3, so we can approximately sample from p T p T p_(T)p_{T}pT by just sampling a Gaussian.
这被称为前向过程 2 2 ^(2){ }^{2}2 ,它将数据分布转化为噪声分布。方程(1)定义了所有 ( x 0 , x 1 , , x T ) x 0 , x 1 , , x T (x_(0),x_(1),dots,x_(T))\left(x_{0}, x_{1}, \ldots, x_{T}\right)(x0,x1,,xT) 的联合分布,我们用 { p t } t [ T ] p t t [ T ] {p_(t)}_(t in[T])\left\{p_{t}\right\}_{t \in[T]}{pt}t[T] 表示每个 x t x t x_(t)x_{t}xt 的边际分布。注意,在大步数 T T TTT 时,分布 p T p T p_(T)p_{T}pT 几乎是高斯分布 3 3 ^(3){ }^{3}3 ,因此我们可以通过简单地从高斯分布中采样来近似地从 p T p T p_(T)p_{T}pT 中采样。
Figure 1: Probability distributions defined by diffusion forward process on one-dimensional target distribution p 0 p 0 p_(0)p_{0}p0.
图 1:由一维目标分布 p 0 p 0 p_(0)p_{0}p0 上的扩散前向过程定义的概率分布。
Now, suppose we can solve the following subproblem:
现在,假设我们可以解决以下子问题:
"Given a sample marginally distributed as p t p t p_(t)p_{t}pt, produce a sample marginally distributed as p t 1 p t 1 p_(t-1)p_{t-1}pt1 ".
"给定一个边际分布为 p t p t p_(t)p_{t}pt 的样本,生成一个边际分布为 p t 1 p t 1 p_(t-1)p_{t-1}pt1 的样本。"
We will call a method that does this a reverse sampler 4 4 ^(4){ }^{4}4, since it tells us how to sample from p t 1 p t 1 p_(t-1)p_{t-1}pt1 assuming we can already sample from p t p t p_(t)p_{t}pt. If we had a reverse sampler, we could sample from our target p 0 p 0 p_(0)p_{0}p0 by simply starting with a Gaussian sample from p T p T p_(T)p_{T}pT, and iteratively applying the reverse sampling procedure to get samples from p T 1 , p T 2 , p T 1 , p T 2 , p_(T-1),p_(T-2),dotsp_{T-1}, p_{T-2}, \ldotspT1,pT2, and finally p 0 = p p 0 = p p_(0)=p^(**)p_{0}=p^{*}p0=p
我们将称这种方法为反向采样器 4 4 ^(4){ }^{4}4 ,因为它告诉我们如何从 p t 1 p t 1 p_(t-1)p_{t-1}pt1 中进行采样,前提是我们已经能够从 p t p t p_(t)p_{t}pt 中进行采样。如果我们有一个反向采样器,我们可以通过简单地从 p T p T p_(T)p_{T}pT 中开始一个高斯样本,并迭代地应用反向采样过程来从我们的目标 p 0 p 0 p_(0)p_{0}p0 中获取样本,最终得到 p T 1 , p T 2 , p T 1 , p T 2 , p_(T-1),p_(T-2),dotsp_{T-1}, p_{T-2}, \ldotspT1,pT2, p 0 = p p 0 = p p_(0)=p^(**)p_{0}=p^{*}p0=p 的样本。
The key insight of diffusion is, learning to reverse each intermediate step can be easier than learning to sample from the target distribution in one step 5 5 ^(5){ }^{5}5. There are many ways to construct reverse samplers, but for concreteness let us first see the standard diffusion sampler which we will call the DDPM sampler 6 6 ^(6){ }^{6}6.
扩散的关键见解是,学习逆转每个中间步骤可能比一步从目标分布中采样更容易 5 5 ^(5){ }^{5}5 。构建逆采样器的方法有很多,但为了具体说明,我们首先来看标准的扩散采样器,我们称之为 DDPM 采样器 6 6 ^(6){ }^{6}6
5 5 ^(5){ }^{5}5 Intuitively this is because the distributions ( p t 1 , p t ) p t 1 , p t (p_(t-1),p_(t))\left(p_{t-1}, p_{t}\right)(pt1,pt) are already quite close, so the reverse sampler does not need to do much.
直观上,这是因为分布 ( p t 1 , p t ) p t 1 , p t (p_(t-1),p_(t))\left(p_{t-1}, p_{t}\right)(pt1,pt) 已经非常接近,因此反向采样器不需要做太多。
The Ideal DDPM sampler uses the obvious strategy: At time t t ttt, given 6 6 ^(6){ }^{6}6 This is the sampling strategy originally input z z zzz (which is promised to be a sample from p t p t p_(t)p_{t}pt ), we output a proposed in Sohl-Dickstein et al. [2015]. sample from the conditional distribution
理想的 DDPM 采样器使用明显的策略:在时间 t t ttt ,给定 6 6 ^(6){ }^{6}6 。这是最初输入的采样策略 z z zzz (承诺是来自 p t p t p_(t)p_{t}pt 的样本),我们输出一个来自 Sohl-Dickstein 等人[2015]的条件分布的提议样本。
(2) p ( x t 1 x t = z ) (2) p x t 1 x t = z {:(2)p(x_(t-1)∣x_(t)=z):}\begin{equation*} p\left(x_{t-1} \mid x_{t}=z\right) \tag{2} \end{equation*}(2)p(xt1xt=z)
This is clearly a correct reverse sampler. The problem is, it requires learning a generative model for the conditional distribution p ( x t 1 x t ) p x t 1 x t p(x_(t-1)∣x_(t))p\left(x_{t-1} \mid x_{t}\right)p(xt1xt) for every x t x t x_(t)x_{t}xt, which could be complicated. But if the per-step noise σ σ sigma\sigmaσ is sufficiently small, then it turns out this conditional distribution becomes simple:
这显然是一个正确的反向采样器。问题在于,它需要为每个 x t x t x_(t)x_{t}xt 学习条件分布 p ( x t 1 x t ) p x t 1 x t p(x_(t-1)∣x_(t))p\left(x_{t-1} \mid x_{t}\right)p(xt1xt) 的生成模型,这可能会很复杂。但是如果每一步的噪声 σ σ sigma\sigmaσ 足够小,那么这个条件分布就变得简单。
Fact 1 (Diffusion Reverse Process). For small σ σ sigma\sigmaσ, and the Gaussian diffusion process defined in ( 1 ) ( 1 ) (1)(1)(1), the conditional distribution p ( x t 1 x t ) p x t 1 x t p(x_(t-1)∣x_(t))p\left(x_{t-1} \mid x_{t}\right)p(xt1xt) is itself close to Gaussian. That is, for all times t t ttt and conditionings z R d z R d z inR^(d)z \in \mathbb{R}^{d}zRd, there exists some mean parameter μ R d μ R d mu inR^(d)\mu \in \mathbb{R}^{d}μRd such that
事实 1(扩散逆过程)。对于小的 σ σ sigma\sigmaσ ,以及在 ( 1 ) ( 1 ) (1)(1)(1) 中定义的高斯扩散过程,条件分布 p ( x t 1 x t ) p x t 1 x t p(x_(t-1)∣x_(t))p\left(x_{t-1} \mid x_{t}\right)p(xt1xt) 本身接近高斯分布。也就是说,对于所有时间 t t ttt 和条件 z R d z R d z inR^(d)z \in \mathbb{R}^{d}zRd ,存在某个均值参数 μ R d μ R d mu inR^(d)\mu \in \mathbb{R}^{d}μRd ,使得
(3) p ( x t 1 x t = z ) N ( x t 1 ; μ , σ 2 ) (3) p x t 1 x t = z N x t 1 ; μ , σ 2 {:(3)p(x_(t-1)∣x_(t)=z)~~N(x_(t-1);mu,sigma^(2)):}\begin{equation*} p\left(x_{t-1} \mid x_{t}=z\right) \approx \mathcal{N}\left(x_{t-1} ; \mu, \sigma^{2}\right) \tag{3} \end{equation*}(3)p(xt1xt=z)N(xt1;μ,σ2)
This is not an obvious fact; we will derive it in Section 2.1. This fact enables a drastic simplification: instead of having to learn an
这并不是一个显而易见的事实;我们将在第 2.1 节中推导它。这个事实使得大幅简化成为可能:不必学习一个
Figure 2: Illustration of Fact 1. The prior distribution p ( x t 1 ) p x t 1 p(x_(t-1))p\left(x_{t-1}\right)p(xt1), leftmost, defines a joint distribution ( x t 1 , x t ) x t 1 , x t (x_(t-1),x_(t))\left(x_{t-1}, x_{t}\right)(xt1,xt) where p ( x t x t 1 ) = N ( 0 , σ 2 ) p x t x t 1 = N 0 , σ 2 p(x_(t)∣x_(t-1))=N(0,sigma^(2))p\left(x_{t} \mid x_{t-1}\right)=\mathcal{N}\left(0, \sigma^{2}\right)p(xtxt1)=N(0,σ2). We plot the reverse conditional distributions p ( x t 1 x t ) p x t 1 x t p(x_(t-1)∣x_(t))p\left(x_{t-1} \mid x_{t}\right)p(xt1xt) for a fixed conditioning x t x t x_(t)x_{t}xt, and varying noise levels σ σ sigma\sigmaσ. Notice these distributions become close to Gaussian for small σ σ sigma\sigmaσ.
图 2:事实 1 的示意图。最左侧的先验分布 p ( x t 1 ) p x t 1 p(x_(t-1))p\left(x_{t-1}\right)p(xt1) 定义了一个联合分布 ( x t 1 , x t ) x t 1 , x t (x_(t-1),x_(t))\left(x_{t-1}, x_{t}\right)(xt1,xt) ,其中 p ( x t x t 1 ) = N ( 0 , σ 2 ) p x t x t 1 = N 0 , σ 2 p(x_(t)∣x_(t-1))=N(0,sigma^(2))p\left(x_{t} \mid x_{t-1}\right)=\mathcal{N}\left(0, \sigma^{2}\right)p(xtxt1)=N(0,σ2) 。我们绘制了固定条件 x t x t x_(t)x_{t}xt 下的反向条件分布 p ( x t 1 x t ) p x t 1 x t p(x_(t-1)∣x_(t))p\left(x_{t-1} \mid x_{t}\right)p(xt1xt) ,并且噪声水平 σ σ sigma\sigmaσ 变化。注意,当 σ σ sigma\sigmaσ 较小时,这些分布接近高斯分布。

arbitrary distribution p ( x t 1 x t ) p x t 1 x t p(x_(t-1)∣x_(t))p\left(x_{t-1} \mid x_{t}\right)p(xt1xt) from scratch, we now know everything about this distribution except its mean, which we denote 7 7 ^(7){ }^{7}7 μ t 1 ( x t ) μ t 1 x t mu_(t-1)(x_(t))\mu_{t-1}\left(x_{t}\right)μt1(xt). The fact that we can approximate the posterior distribution as Gaussian when σ σ sigma\sigmaσ is sufficiently small is illustrated in Fig 2. This is an important point, so to re-iterate: for a given time t t ttt and conditioning value x t x t x_(t)x_{t}xt, learning the mean of p ( x t 1 x t ) p x t 1 x t p(x_(t-1)∣x_(t))p\left(x_{t-1} \mid x_{t}\right)p(xt1xt) is sufficient to learn the full conditional distribution p ( x t 1 x t ) p x t 1 x t p(x_(t-1)∣x_(t))p\left(x_{t-1} \mid x_{t}\right)p(xt1xt).
从头开始任意分布 p ( x t 1 x t ) p x t 1 x t p(x_(t-1)∣x_(t))p\left(x_{t-1} \mid x_{t}\right)p(xt1xt) ,我们现在对该分布的所有信息都已了解,除了它的均值,我们用 7 7 ^(7){ }^{7}7 μ t 1 ( x t ) μ t 1 x t mu_(t-1)(x_(t))\mu_{t-1}\left(x_{t}\right)μt1(xt) 表示。图 2 展示了当 σ σ sigma\sigmaσ 足够小时,我们可以将后验分布近似为高斯分布。这是一个重要的观点,因此重申一下:对于给定的时间 t t ttt 和条件值 x t x t x_(t)x_{t}xt ,学习 p ( x t 1 x t ) p x t 1 x t p(x_(t-1)∣x_(t))p\left(x_{t-1} \mid x_{t}\right)p(xt1xt) 的均值足以学习完整的条件分布 p ( x t 1 x t ) p x t 1 x t p(x_(t-1)∣x_(t))p\left(x_{t-1} \mid x_{t}\right)p(xt1xt)
Learning the mean of p ( x t 1 x t ) p x t 1 x t p(x_(t-1)∣x_(t))p\left(x_{t-1} \mid x_{t}\right)p(xt1xt) is a much simpler problem than learning the full conditional distribution, because we can solve it by regression. To elaborate, we have a joint distribution ( x t 1 , x t ) x t 1 , x t (x_(t-1),x_(t))\left(x_{t-1}, x_{t}\right)(xt1,xt) from which we can easily sample, and we would like to estimate E [ x t 1 x t ] E x t 1 x t E[x_(t-1)∣x_(t)]\mathbb{E}\left[x_{t-1} \mid x_{t}\right]E[xt1xt]. This can be done by optimizing a standard regression loss 8 8 ^(8)^{8}8 :
学习 p ( x t 1 x t ) p x t 1 x t p(x_(t-1)∣x_(t))p\left(x_{t-1} \mid x_{t}\right)p(xt1xt) 的均值是一个比学习完整条件分布简单得多的问题,因为我们可以通过回归来解决它。具体来说,我们有一个联合分布 ( x t 1 , x t ) x t 1 , x t (x_(t-1),x_(t))\left(x_{t-1}, x_{t}\right)(xt1,xt) ,可以轻松地从中进行采样,我们希望估计 E [ x t 1 x t ] E x t 1 x t E[x_(t-1)∣x_(t)]\mathbb{E}\left[x_{t-1} \mid x_{t}\right]E[xt1xt] 。这可以通过优化标准回归损失 8 8 ^(8)^{8}8 来完成:
(4) μ t 1 ( z ) := E [ x t 1 x t = z ] (5) μ t 1 = argmin f : R d R d E x t , x t 1 f ( x t ) x t 1 2 2 (6) = argmin f : R d R d x t 1 , η E f ( x t 1 + η t ) x t 1 ) 2 2 , (4) μ t 1 ( z ) := E x t 1 x t = z (5) μ t 1 = argmin f : R d R d E x t , x t 1 f x t x t 1 2 2 (6) = argmin f : R d R d x t 1 , η E f x t 1 + η t x t 1 2 2 , {:[(4)mu_(t-1)(z):=E[x_(t-1)∣x_(t)=z]],[(5) Longrightarrowmu_(t-1)=argmin_(f:R^(d)rarrR^(d))E_(x_(t),x_(t-1))||f(x_(t))-x_(t-1)||_(2)^(2)],[(6){:=argmin_(f:R^(d)rarrR^(d))x_(t-1,eta)^(E)||f(x_(t-1)+eta_(t))-x_(t-1))||_(2)^(2)","]:}\begin{align*} & \mu_{t-1}(z):=\mathbb{E}\left[x_{t-1} \mid x_{t}=z\right] \tag{4}\\ & \Longrightarrow \mu_{t-1}=\underset{f: \mathbb{R}^{d} \rightarrow \mathbb{R}^{d}}{\operatorname{argmin}} \underset{x_{t}, x_{t-1}}{\mathbb{E}}\left\|f\left(x_{t}\right)-x_{t-1}\right\|_{2}^{2} \tag{5}\\ & \left.=\underset{f: \mathbb{R}^{d} \rightarrow \mathbb{R}^{d}}{\operatorname{argmin}} x_{t-1, \eta}^{\mathbb{E}} \| f\left(x_{t-1}+\eta_{t}\right)-x_{t-1}\right) \|_{2}^{2}, \tag{6} \end{align*}(4)μt1(z):=E[xt1xt=z](5)μt1=argminf:RdRdExt,xt1f(xt)xt122(6)=argminf:RdRdxt1,ηEf(xt1+ηt)xt1)22,
where the expectation is taken over samples x 0 x 0 x_(0)x_{0}x0 from our target distribution p .9 p .9 p^(**).9p^{*} .9p.9 This particular regression problem is well-studied in certain settings. For example, when the target p p p^(**)p^{*}p is a distribution on images, then the corresponding regression problem (Equation 6) is exactly an image denoising objective, which can be approached with familiar methods (e.g. convolutional neural networks).
在这里,期望是针对来自我们目标分布 p .9 p .9 p^(**).9p^{*} .9p.9 的样本 x 0 x 0 x_(0)x_{0}x0 进行的。这个特定的回归问题在某些环境中得到了充分研究。例如,当目标 p p p^(**)p^{*}p 是图像上的分布时,相应的回归问题(方程 6)恰好是一个图像去噪目标,可以通过熟悉的方法(例如卷积神经网络)来处理。
Stepping back, we have seen something remarkable: we have reduced the problem of learning to sample from an arbitrary distribution to the standard problem of regression.
退一步来看,我们看到了一些显著的事情:我们将从任意分布中学习的样本问题简化为标准的回归问题。

1.2 Diffusions in the Abstract
1.2 抽象中的扩散

Let us now abstract away the Gaussian setting, to define diffusionlike models in a way that will capture their many instantiations (including deterministic samplers, discrete domains, and flowmatching).
现在让我们抽象掉高斯设置,以一种能够捕捉其多种实例(包括确定性采样器、离散域和流匹配)来定义扩散类模型。
Abstractly, here is how to construct a diffusion-like generative model: We start with our target distribution p p p^(**)p^{*}p, and we pick some base distribution q ( x ) q ( x ) q(x)q(x)q(x) which is easy to sample from, e.g. a standard Gaussian or i.i.d bits. We then try to construct a sequence of distributions which interpolate between our target p p p^(**)p^{*}p and the base distribution q q qqq. That is, we construct distributions
抽象地说,构建一个类似扩散的生成模型的方法如下:我们从目标分布 p p p^(**)p^{*}p 开始,然后选择一个易于采样的基础分布 q ( x ) q ( x ) q(x)q(x)q(x) ,例如标准高斯分布或独立同分布的比特。接着,我们尝试构建一个分布序列,该序列在目标分布 p p p^(**)p^{*}p 和基础分布 q q qqq 之间进行插值。也就是说,我们构建分布。
(7) p 0 , p 1 , p 2 , , p T (7) p 0 , p 1 , p 2 , , p T {:(7)p_(0)","p_(1)","p_(2)","dots","p_(T):}\begin{equation*} p_{0}, p_{1}, p_{2}, \ldots, p_{T} \tag{7} \end{equation*}(7)p0,p1,p2,,pT
7 7 ^(7){ }^{7}7 We denote the mean as a function μ t 1 : R d R d μ t 1 : R d R d mu_(t-1):R^(d)rarrR^(d)\mu_{t-1}: \mathbb{R}^{d} \rightarrow \mathbb{R}^{d}μt1:RdRd because the mean of p ( x t 1 x t ) p x t 1 x t p(x_(t-1)∣x_(t))p\left(x_{t-1} \mid x_{t}\right)p(xt1xt) depends on the time t t ttt as well as the conditioning x t x t x_(t)x_{t}xt, as described in Fact 1 .
我们将均值表示为一个函数 μ t 1 : R d R d μ t 1 : R d R d mu_(t-1):R^(d)rarrR^(d)\mu_{t-1}: \mathbb{R}^{d} \rightarrow \mathbb{R}^{d}μt1:RdRd ,因为 p ( x t 1 x t ) p x t 1 x t p(x_(t-1)∣x_(t))p\left(x_{t-1} \mid x_{t}\right)p(xt1xt) 的均值依赖于时间 t t ttt 以及条件 x t x t x_(t)x_{t}xt ,如事实 1 所述。
8 8 ^(8){ }^{8}8 Recall the generic fact that for any distribution over ( x , y ) ( x , y ) (x,y)(x, y)(x,y), we have: argmin f E f ( x ) y 2 = E [ y x ] argmin f E f ( x ) y 2 = E [ y x ] argmin_(f)E||f(x)-y||^(2)=E[y∣x]\operatorname{argmin}_{f} \mathbb{E}\|f(x)-y\|^{2}=\mathbb{E}[y \mid x]argminfEf(x)y2=E[yx]
回忆一下,对于任何在 ( x , y ) ( x , y ) (x,y)(x, y)(x,y) 上的分布,我们有: argmin f E f ( x ) y 2 = E [ y x ] argmin f E f ( x ) y 2 = E [ y x ] argmin_(f)E||f(x)-y||^(2)=E[y∣x]\operatorname{argmin}_{f} \mathbb{E}\|f(x)-y\|^{2}=\mathbb{E}[y \mid x]argminfEf(x)y2=E[yx]
9 9 ^(9){ }^{9}9 Notice that we simulate samples of ( x t 1 , x t ) x t 1 , x t (x_(t-1),x_(t))\left(x_{t-1}, x_{t}\right)(xt1,xt) by adding noise to the samples of x 0 x 0 x_(0)x_{0}x0, as defined in Equation 1.
注意,我们通过向 x 0 x 0 x_(0)x_{0}x0 的样本添加噪声来模拟 ( x t 1 , x t ) x t 1 , x t (x_(t-1),x_(t))\left(x_{t-1}, x_{t}\right)(xt1,xt) 的样本,如方程 1 所定义。

such that p 0 = p p 0 = p p_(0)=p^(**)p_{0}=p^{*}p0=p is our target, p T = q p T = q p_(T)=qp_{T}=qpT=q the base distribution, and adjacent distributions ( p t 1 , p t ) p t 1 , p t (p_(t-1),p_(t))\left(p_{t-1}, p_{t}\right)(pt1,pt) are marginally "close" in some appropriate sense. Then, we learn a reverse sampler which transforms distributions p t p t p_(t)p_{t}pt to p t 1 p t 1 p_(t-1)p_{t-1}pt1. This is the key learning step, which presumably is made easier by the fact that adjacent distributions are "close." Formally, reverse samplers are defined below.
使得 p 0 = p p 0 = p p_(0)=p^(**)p_{0}=p^{*}p0=p 是我们的目标, p T = q p T = q p_(T)=qp_{T}=qpT=q 是基础分布,而相邻分布 ( p t 1 , p t ) p t 1 , p t (p_(t-1),p_(t))\left(p_{t-1}, p_{t}\right)(pt1,pt) 在某种适当的意义上是边际上“接近”的。然后,我们学习一个反向采样器,它将分布 p t p t p_(t)p_{t}pt 转换为 p t 1 p t 1 p_(t-1)p_{t-1}pt1 。这是关键的学习步骤,假设由于相邻分布“接近”,这一过程变得更容易。正式地,反向采样器的定义如下。
Definition 1 (Reverse Sampler). Given a sequence of marginal distributions p t p t p_(t)p_{t}pt, a reverse sampler for step t t ttt is a potentially stochastic function F t F t F_(t)F_{t}Ft such that if x t p t x t p t x_(t)∼p_(t)x_{t} \sim p_{t}xtpt, then the marginal distribution of F t ( x t ) F t x t F_(t)(x_(t))F_{t}\left(x_{t}\right)Ft(xt) is exactly p t 1 p t 1 p_(t-1)p_{t-1}pt1 :
定义 1(反向采样器)。给定一系列边际分布 p t p t p_(t)p_{t}pt ,步骤 t t ttt 的反向采样器是一个潜在的随机函数 F t F t F_(t)F_{t}Ft ,使得如果 x t p t x t p t x_(t)∼p_(t)x_{t} \sim p_{t}xtpt ,则 F t ( x t ) F t x t F_(t)(x_(t))F_{t}\left(x_{t}\right)Ft(xt) 的边际分布恰好是 p t 1 p t 1 p_(t-1)p_{t-1}pt1
(8) { F t ( z ) : z p t } p t 1 (8) F t ( z ) : z p t p t 1 {:(8){F_(t)(z):z∼p_(t)}-=p_(t-1):}\begin{equation*} \left\{F_{t}(z): z \sim p_{t}\right\} \equiv p_{t-1} \tag{8} \end{equation*}(8){Ft(z):zpt}pt1
There are many possible reverse samplers 10 10 ^(10){ }^{10}10, and it is even possible to construct reverse samplers which are deterministic. In the remainder of this tutorial we will see three popular reverse samplers more formally: the DDPM sampler discussed above (Section 2.1), the DDIM sampler (Section 3), which is deterministic, and the family of flow-matching models (Section 4), which can be thought of as a generalization of DDIM. 11 11 ^(11){ }^{11}11
有许多可能的反向采样器 10 10 ^(10){ }^{10}10 ,甚至可以构造出确定性的反向采样器。在本教程的其余部分,我们将更正式地介绍三种流行的反向采样器:上述讨论的 DDPM 采样器(第 2.1 节)、确定性的 DDIM 采样器(第 3 节)以及流匹配模型家族(第 4 节),可以将其视为 DDIM 的推广。 11 11 ^(11){ }^{11}11

1.3 Discretization 1.3 离散化

Before we proceed further, we need to be more precise about what we mean by adjacent distributions p t , p t 1 p t , p t 1 p_(t),p_(t-1)p_{t}, p_{t-1}pt,pt1 being "close". We want to think of the sequence p 0 , p 1 , , p T p 0 , p 1 , , p T p_(0),p_(1),dots,p_(T)p_{0}, p_{1}, \ldots, p_{T}p0,p1,,pT as the discretization of some (well-behaved) time-evolving function p ( x , t ) p ( x , t ) p(x,t)p(x, t)p(x,t), that starts from the target distribution p 0 p 0 p_(0)p_{0}p0 at time t = 0 t = 0 t=0t=0t=0 and ends at the noisy distribution p T p T p_(T)p_{T}pT at time t = 1 t = 1 t=1t=1t=1 :
在我们进一步讨论之前,我们需要更准确地定义相邻分布 p t , p t 1 p t , p t 1 p_(t),p_(t-1)p_{t}, p_{t-1}pt,pt1 “接近”的含义。我们希望将序列 p 0 , p 1 , , p T p 0 , p 1 , , p T p_(0),p_(1),dots,p_(T)p_{0}, p_{1}, \ldots, p_{T}p0,p1,,pT 视为某个(表现良好的)时间演变函数 p ( x , t ) p ( x , t ) p(x,t)p(x, t)p(x,t) 的离散化,该函数在时间 t = 0 t = 0 t=0t=0t=0 时从目标分布 p 0 p 0 p_(0)p_{0}p0 开始,并在时间 t = 1 t = 1 t=1t=1t=1 时结束于噪声分布 p T p T p_(T)p_{T}pT
(9) p ( x , k Δ t ) = p k ( x ) , where Δ t = 1 T (9) p ( x , k Δ t ) = p k ( x ) ,  where  Δ t = 1 T {:(9)p(x","k Delta t)=p_(k)(x)","quad" where "Delta t=(1)/(T):}\begin{equation*} p(x, k \Delta t)=p_{k}(x), \quad \text { where } \Delta t=\frac{1}{T} \tag{9} \end{equation*}(9)p(x,kΔt)=pk(x), where Δt=1T
The number of steps T T TTT controls the fineness of the discretization (hence the closeness of adjacent distributions). 12 12 ^(12){ }^{12}12
步数 T T TTT 控制离散化的精细程度(因此相邻分布的接近程度)。 12 12 ^(12){ }^{12}12
In order to ensure that the variance of the final distribution, p T p T p_(T)p_{T}pT, is independent of the number of discretization steps, we also need to be more specific about the variance of each increment. Note that if x k = x k 1 + N ( 0 , σ 2 ) x k = x k 1 + N 0 , σ 2 x_(k)=x_(k-1)+N(0,sigma^(2))x_{k}=x_{k-1}+\mathcal{N}\left(0, \sigma^{2}\right)xk=xk1+N(0,σ2), then x T N ( x 0 , T σ 2 ) x T N x 0 , T σ 2 x_(T)∼N(x_(0),Tsigma^(2))x_{T} \sim \mathcal{N}\left(x_{0}, T \sigma^{2}\right)xTN(x0,Tσ2). Therefore, we need to scale the variance of each increment by Δ t = 1 / T Δ t = 1 / T Delta t=1//T\Delta t=1 / TΔt=1/T, that is, choose
(10) σ = σ q Δ t (10) σ = σ q Δ t {:(10)sigma=sigma_(q)sqrt(Delta t):}\begin{equation*} \sigma=\sigma_{q} \sqrt{\Delta t} \tag{10} \end{equation*}(10)σ=σqΔt
where σ q 2 σ q 2 sigma_(q)^(2)\sigma_{q}^{2}σq2 is the desired terminal variance. This choice ensures that the variance of p T p T p_(T)p_{T}pT is always σ q 2 σ q 2 sigma_(q)^(2)\sigma_{q}^{2}σq2, regardless of T T TTT. (The Δ t Δ t sqrt(Delta t)\sqrt{\Delta t}Δt scaling will turn out to be important in our arguments for the correctness of our reverse solvers in the next chapter, and also connects to the SDE formulation in Section 2.4.) 10 10 ^(10){ }^{10}10 Notice that none of this abstraction is specific to the case of Gaussian noisein fact, it does not even require the concept of "adding noise". It is even possible to instantiate in discrete settings, where we consider distributions p p p^(**)p^{*}p over a finite set, and define corresponding "interpolating distributions" and reverse samplers.
11 11 ^(11){ }^{11}11 Given a set of marginal distributions { p t } p t {p_(t)}\left\{p_{t}\right\}{pt}, there are many possible joint distributions consistent with these marginals (such joint distributions are called couplings). There is therefore no canonical reverse sampler for a given set of marginals { p t } p t {p_(t)}\left\{p_{t}\right\}{pt} - we are free to chose whichever coupling is most convenient.
At this point, it is convenient to adjust our notation. From here on, t t ttt will represent a continuous-value in the interval [ 0 , 1 ] [ 0 , 1 ] [0,1][0,1][0,1] (specifically, taking one of the values 0 , Δ t , 2 Δ t , , T Δ t = 1 0 , Δ t , 2 Δ t , , T Δ t = 1 0,Delta t,2Delta t,dots,T Delta t=10, \Delta t, 2 \Delta t, \ldots, T \Delta t=10,Δt,2Δt,,TΔt=1 ). Subscripts will indicate time rather than index, so for example x t x t x_(t)x_{t}xt will now denote x x xxx at a discretized time t t ttt. That is, Equation 1 becomes:
(11) x t + Δ t := x t + η t , η t N ( 0 , σ q 2 Δ t ) (11) x t + Δ t := x t + η t , η t N 0 , σ q 2 Δ t {:(11)x_(t+Delta t):=x_(t)+eta_(t)","quadeta_(t)∼N(0,sigma_(q)^(2)Delta t):}\begin{equation*} x_{t+\Delta t}:=x_{t}+\eta_{t}, \quad \eta_{t} \sim \mathcal{N}\left(0, \sigma_{q}^{2} \Delta t\right) \tag{11} \end{equation*}(11)xt+Δt:=xt+ηt,ηtN(0,σq2Δt)
which also implies that
(12) x t N ( x 0 , σ t 2 ) , where σ t := σ q t (12) x t N x 0 , σ t 2 ,  where  σ t := σ q t {:(12)x_(t)∼N(x_(0),sigma_(t)^(2))","quad" where "sigma_(t):=sigma_(q)sqrtt:}\begin{equation*} x_{t} \sim \mathcal{N}\left(x_{0}, \sigma_{t}^{2}\right), \quad \text { where } \sigma_{t}:=\sigma_{q} \sqrt{t} \tag{12} \end{equation*}(12)xtN(x0,σt2), where σt:=σqt
since the total noise added up to time t t ttt (i.e. τ { 0 , Δ t , 2 Δ t , , t Δ t } η τ τ { 0 , Δ t , 2 Δ t , , t Δ t } η τ sum_(tau in{0,Delta t,2Delta t,dots,t-Delta t})eta_(tau)\sum_{\tau \in\{0, \Delta t, 2 \Delta t, \ldots, t-\Delta t\}} \eta_{\tau}τ{0,Δt,2Δt,,tΔt}ητ ) is also Gaussian with mean zero and variance τ σ q 2 Δ t = σ q 2 t τ σ q 2 Δ t = σ q 2 t sum_(tau)sigma_(q)^(2)Delta t=sigma_(q)^(2)t\sum_{\tau} \sigma_{q}^{2} \Delta t=\sigma_{q}^{2} tτσq2Δt=σq2t.

2 Stochastic Sampling: DDPM

In this section we review the DDPM-like reverse sampler discussed in Section 1, and heuristically prove its correctness. This sampler is conceptually the same as the sampler popularized in Denoising Diffusion Probabilistic Models (DDPM) by Ho et al. [2020] and originally introduced by Sohl-Dickstein et al. [2015], when adapted to our simplified setting. However, a word of warning for the reader familiar with Ho et al. [2020]: Although the overall strategy of our sampler is identical to Ho et al. [2020], certain technical details (like constants, etc) are slightly different 13 13 ^(13){ }^{13}13.
We consider the setup from Section 1.3, with some target distribution p p p^(**)p^{*}p and the joint distribution of noisy samples ( x 0 , x Δ t , , x 1 ) x 0 , x Δ t , , x 1 (x_(0),x_(Delta t),dots,x_(1))\left(x_{0}, x_{\Delta t}, \ldots, x_{1}\right)(x0,xΔt,,x1) defined by Equation (11). The DDPM sampler will require estimates of the following conditional expectations:
(13) μ t ( z ) := E [ x t x t + Δ t = z ] . (13) μ t ( z ) := E x t x t + Δ t = z . {:(13)mu_(t)(z):=E[x_(t)∣x_(t+Delta t)=z].:}\begin{equation*} \mu_{t}(z):=\mathbb{E}\left[x_{t} \mid x_{t+\Delta t}=z\right] . \tag{13} \end{equation*}(13)μt(z):=E[xtxt+Δt=z].
This is a set of functions { μ t } μ t {mu_(t)}\left\{\mu_{t}\right\}{μt}, one for every time step t { 0 , Δ t , , 1 Δ t } t { 0 , Δ t , , 1 Δ t } t in{0,Delta t,dots,1-Delta t}t \in\{0, \Delta t, \ldots, 1-\Delta t\}t{0,Δt,,1Δt}. In the training phase, we estimate these functions from i.i.d. samples of x 0 x 0 x_(0)x_{0}x0, by optimizing the denoising regression objective
(14) μ t = argmin f : R d R d E x t , x t + Δ t f ( x t + Δ t ) x t 2 2 (14) μ t = argmin f : R d R d E x t , x t + Δ t f x t + Δ t x t 2 2 {:(14)mu_(t)=argmin_(f:R^(d)rarrR^(d))E_(x_(t),x_(t+Delta t))||f(x_(t+Delta t))-x_(t)||_(2)^(2):}\begin{equation*} \mu_{t}=\underset{f: \mathbb{R}^{d} \rightarrow \mathbb{R}^{d}}{\operatorname{argmin}} \underset{x_{t}, x_{t+\Delta t}}{\mathbb{E}}\left\|f\left(x_{t+\Delta t}\right)-x_{t}\right\|_{2}^{2} \tag{14} \end{equation*}(14)μt=argminf:RdRdExt,xt+Δtf(xt+Δt)xt22
typically with a neural-network 14 14 ^(14){ }^{14}14 parameterizing f f fff. Then, in the inference phase, we use the estimated functions in the following reverse sampler.
Algorithm 1: Stochastic Reverse Sampler (DDPM-like) For input sample x t , and timestep t , output:  Algorithm 1: Stochastic Reverse Sampler (DDPM-like)   For input sample  x t , and timestep  t , output:  {:[" Algorithm 1: Stochastic Reverse Sampler (DDPM-like) "],[" For input sample "x_(t)", and timestep "t", output: "]:}\begin{aligned} & \text { Algorithm 1: Stochastic Reverse Sampler (DDPM-like) } \\ & \hline \text { For input sample } x_{t} \text {, and timestep } t \text {, output: } \end{aligned} Algorithm 1: Stochastic Reverse Sampler (DDPM-like)  For input sample xt, and timestep t, output: 
(15) x ^ t Δ t μ t Δ t ( x t ) + N ( 0 , σ q 2 Δ t ) (15) x ^ t Δ t μ t Δ t x t + N 0 , σ q 2 Δ t {:(15) widehat(x)_(t-Delta t)larrmu_(t-Delta t)(x_(t))+N(0,sigma_(q)^(2)Delta t):}\begin{equation*} \widehat{x}_{t-\Delta t} \leftarrow \mu_{t-\Delta t}\left(x_{t}\right)+\mathcal{N}\left(0, \sigma_{q}^{2} \Delta t\right) \tag{15} \end{equation*}(15)x^tΔtμtΔt(xt)+N(0,σq2Δt)
To actually generate a sample, we first sample x 1 x 1 x_(1)x_{1}x1 as an isotropic Gaussian x 1 N ( 0 , σ q 2 ) x 1 N 0 , σ q 2 x_(1)∼N(0,sigma_(q)^(2))x_{1} \sim \mathcal{N}\left(0, \sigma_{q}^{2}\right)x1N(0,σq2), and then run the iteration of Algorithm 1 down to t = 0 t = 0 t=0t=0t=0, to produce a generated sample x ^ 0 x ^ 0 widehat(x)_(0)\widehat{x}_{0}x^0. (Recall that in our discretized notation (12), x 1 x 1 x_(1)x_{1}x1 is the fully-noised terminal distribution, and the iteration takes steps of size Δ t Δ t Delta t\Delta tΔt.) Explicit pseudocode for these algorithms are given in Section 2.2.
We want to reason about correctness of this entire procedure: why does iterating Algorithm 1 produce a sample from [approximately] our target distribution p p p^(**)p^{*}p ? The key missing piece is, we need to prove some version of Fact 1: that the true conditional p ( x t Δ t x t ) p x t Δ t x t p(x_(t-Delta t)∣x_(t))p\left(x_{t-\Delta t} \mid x_{t}\right)p(xtΔtxt) can be well-approximated by a Gaussian, and this approximation gets better as we scale Δ t 0 Δ t 0 Delta t rarr0\Delta t \rightarrow 0Δt0.
13 13 ^(13){ }^{13}13 For the experts, the main difference is we use the "Variance Exploding" diffusion forward process. We also use a constant noise schedule, and we do not discuss how to parameterize the predictor ("predicting x 0 x 0 x_(0)x_{0}x0 vs. x t 1 x t 1 x_(t-1)x_{t-1}xt1 vs. noise η η eta^('')\eta^{\prime \prime}η ). We elaborate on the latter point in Section 2.3

2.1 Correctness of DDPM

Here is a more precise version of Fact 1 , along with a heuristic derivation. This will complete the argument that Algorithm 1 is correct- i.e. that it approximates a valid reverse sampler in the sense of Definition 1.
Claim 1 (Informal). Let p t Δ t ( x ) p t Δ t ( x ) p_(t-Delta t)(x)p_{t-\Delta t}(x)ptΔt(x) be an arbitrary, sufficiently-smooth density over R d R d R^(d)\mathbb{R}^{d}Rd. Consider the joint distribution of ( x t Δ t , x t ) x t Δ t , x t (x_(t-Delta t),x_(t))\left(x_{t-\Delta t}, x_{t}\right)(xtΔt,xt), where x t Δ t p t Δ t x t Δ t p t Δ t x_(t-Delta t)∼p_(t-Delta t)x_{t-\Delta t} \sim p_{t-\Delta t}xtΔtptΔt and x t x t Δ t + N ( 0 , σ q 2 Δ t ) x t x t Δ t + N 0 , σ q 2 Δ t x_(t)∼x_(t-Delta t)+N(0,sigma_(q)^(2)Delta t)x_{t} \sim x_{t-\Delta t}+\mathcal{N}\left(0, \sigma_{q}^{2} \Delta t\right)xtxtΔt+N(0,σq2Δt). Then, for sufficiently small Δ t Δ t Delta t\Delta tΔt, the following holds. For all conditionings z R d z R d z inR^(d)z \in \mathbb{R}^{d}zRd, there exists μ z μ z mu_(z)\mu_{z}μz such that:
(16) p ( x t Δ t x t = z ) N ( x t Δ t ; μ z , σ q 2 Δ t ) . (16) p x t Δ t x t = z N x t Δ t ; μ z , σ q 2 Δ t . {:(16)p(x_(t-Delta t)∣x_(t)=z)~~N(x_(t-Delta t);mu_(z),sigma_(q)^(2)Delta t).:}\begin{equation*} p\left(x_{t-\Delta t} \mid x_{t}=z\right) \approx \mathcal{N}\left(x_{t-\Delta t} ; \mu_{z}, \sigma_{q}^{2} \Delta t\right) . \tag{16} \end{equation*}(16)p(xtΔtxt=z)N(xtΔt;μz,σq2Δt).
for some constant μ z μ z mu_(z)\mu_{z}μz depending only on z z zzz. Moreover, it suffices to take 15 15 ^(15){ }^{15}15
(17) μ z := E ( x t Δ t , x t ) [ x t Δ t x t = z ] (18) = z + ( σ q 2 Δ t ) log p t ( z ) , (17) μ z := E x t Δ t , x t x t Δ t x t = z (18) = z + σ q 2 Δ t log p t ( z ) , {:[(17)mu_(z):=E_((x_(t-Delta t),x_(t)))[x_(t-Delta t)∣x_(t)=z]],[(18)=z+(sigma_(q)^(2)Delta t)grad log p_(t)(z)","]:}\begin{align*} \mu_{z} & :=\underset{\left(x_{t-\Delta t}, x_{t}\right)}{\mathbb{E}}\left[x_{t-\Delta t} \mid x_{t}=z\right] \tag{17}\\ & =z+\left(\sigma_{q}^{2} \Delta t\right) \nabla \log p_{t}(z), \tag{18} \end{align*}(17)μz:=E(xtΔt,xt)[xtΔtxt=z](18)=z+(σq2Δt)logpt(z),
where p t p t p_(t)p_{t}pt is the marginal distribution of x t x t x_(t)x_{t}xt.
Before we see the derivation, a few remarks: Claim 1 implies that to sample from x t Δ t x t Δ t x_(t-Delta t)x_{t-\Delta t}xtΔt, it suffices to first sample from x t x t x_(t)x_{t}xt, then sample from a Gaussian distribution centered around E [ x t Δ t x t ] E x t Δ t x t E[x_(t-Delta t)∣x_(t)]\mathbb{E}\left[x_{t-\Delta t} \mid x_{t}\right]E[xtΔtxt]. This is exactly what DDPM does, in Equation (15). Finally, in these notes we will not actually need the expression for μ z μ z mu_(z)\mu_{z}μz in Equation (18); it is enough for us know that such a μ z μ z mu_(z)\mu_{z}μz exists, so we can learn it from samples.
Proof of Claim 1 (Informal). Here is a heuristic argument for why the score appears in the reverse process. We will essentially just apply Bayes rule and then Taylor expand appropriately. We start with Bayes rule:
(19) p ( x t Δ t x t ) = p ( x t x t Δ t ) p t Δ t ( x t Δ t ) / p t ( x t ) (19) p x t Δ t x t = p x t x t Δ t p t Δ t x t Δ t / p t x t {:(19)p(x_(t-Delta t)∣x_(t))=p(x_(t)∣x_(t-Delta t))p_(t-Delta t)(x_(t-Delta t))//p_(t)(x_(t)):}\begin{equation*} p\left(x_{t-\Delta t} \mid x_{t}\right)=p\left(x_{t} \mid x_{t-\Delta t}\right) p_{t-\Delta t}\left(x_{t-\Delta t}\right) / p_{t}\left(x_{t}\right) \tag{19} \end{equation*}(19)p(xtΔtxt)=p(xtxtΔt)ptΔt(xtΔt)/pt(xt)
Then take logs of both sizes. Throughout, we will drop any additive constants in the log log log\loglog (which translate to normalizing factors), and drop all terms of order O ( Δ t ) 16 O ( Δ t ) 16 O(Delta t)^(16)\mathcal{O}(\Delta t)^{16}O(Δt)16. Note that we should think of x t x t x_(t)x_{t}xt as a constant in this derivation, since we want to understand the
15 15 ^(15){ }^{15}15 Experts will recognize this mean as
related to the score. In fact, Tweedie's
formula implies that this mean is
exactly correct even for large Δ t Δ t Delta t\Delta tΔt, with
no approximation required. That is,
E [ x t Δ t x t = z ] = z + σ q 2 Δ t log p t ( z ) E x t Δ t x t = z = z + σ q 2 Δ t log p t ( z ) E[x_(t-Delta t)∣x_(t)=z]=z+sigma_(q)^(2)Delta t grad log p_(t)(z)\mathbb{E}\left[x_{t-\Delta t} \mid x_{t}=z\right]=z+\sigma_{q}^{2} \Delta t \nabla \log p_{t}(z)E[xtΔtxt=z]=z+σq2Δtlogpt(z).
The distribution p ( x t Δ t x t ) p x t Δ t x t p(x_(t-Delta t)∣x_(t))p\left(x_{t-\Delta t} \mid x_{t}\right)p(xtΔtxt) may
deviate from Gaussian, however, for
larger σ σ sigma\sigmaσ.
16 16 ^(16){ }^{16}16 Note that x t + 1 x t O ( Δ t ) x t + 1 x t O ( Δ t ) x_(t+1)-x_(t)∼O(sqrt(Delta t))x_{t+1}-x_{t} \sim \mathcal{O}(\sqrt{\Delta t})xt+1xtO(Δt).
Dropping O ( Δ t ) O ( Δ t ) O(Delta t)\mathcal{O}(\Delta t)O(Δt) terms means dropping
( x t + 1 x t ) 2 O ( Δ t ) x t + 1 x t 2 O ( Δ t ) (x_(t+1)-x_(t))^(2)∼O(Delta t)\left(x_{t+1}-x_{t}\right)^{2} \sim \mathcal{O}(\Delta t)(xt+1xt)2O(Δt) in the expansion
of p t ( x t ) p t x t p_(t)(x_(t))p_{t}\left(x_{t}\right)pt(xt), but keeping 1 2 σ q 2 Δ t ( x t + 1 1 2 σ q 2 Δ t x t + 1 (1)/(2sigma_(q)^(2)Delta t)(x_(t+1)-:}\frac{1}{2 \sigma_{q}^{2} \Delta t}\left(x_{t+1}-\right.12σq2Δt(xt+1
x t ) 2 O ( 1 ) x t 2 O ( 1 ) x_(t))^(2)∼O(1)\left.x_{t}\right)^{2} \sim \mathcal{O}(1)xt)2O(1) in p ( x t x t + 1 ) p x t x t + 1 p(x_(t)∣x_(t+1))p\left(x_{t} \mid x_{t+1}\right)p(xtxt+1)
conditional probability as a function of x t Δ t x t Δ t x_(t-Delta t)x_{t-\Delta t}xtΔt. Now:
log p ( x t Δ t x t ) = log p ( x t x t Δ t ) + log p t Δ t ( x t Δ t ) log p t ( x t ) = log p ( x t x t Δ t ) + log p t ( x t Δ t ) + O ( Δ t ) = 1 2 σ q 2 Δ t x t Δ t x t 2 2 + log p t ( x t Δ t ) = 1 2 σ q 2 Δ t x t Δ t x t 2 2 + log _ p t ( x t ) + x log p t ( x t ) , ( x t Δ t x t ) + O ( Δ t ) = 1 2 σ q 2 Δ t ( x t Δ t x t 2 2 2 σ q 2 Δ t x log p t ( x t ) , ( x t Δ t x t ) ) = 1 2 σ q 2 Δ t x t Δ t x t σ q 2 Δ t x log p t ( x t ) 2 2 + C = 1 2 σ q 2 Δ t x t Δ t μ 2 2 log p x t Δ t x t = log p x t x t Δ t + log p t Δ t x t Δ t log p t x t = log p x t x t Δ t + log p t x t Δ t + O ( Δ t ) = 1 2 σ q 2 Δ t x t Δ t x t 2 2 + log p t x t Δ t = 1 2 σ q 2 Δ t x t Δ t x t 2 2 + log _ p t x t + x log p t x t , x t Δ t x t + O ( Δ t ) = 1 2 σ q 2 Δ t x t Δ t x t 2 2 2 σ q 2 Δ t x log p t x t , x t Δ t x t = 1 2 σ q 2 Δ t x t Δ t x t σ q 2 Δ t x log p t x t 2 2 + C = 1 2 σ q 2 Δ t x t Δ t μ 2 2 {:[log p(x_(t-Delta t)∣x_(t))=log p(x_(t)∣x_(t-Delta t))+log p_(t-Delta t)(x_(t-Delta t))-log p_(t)(x_(t))],[=log p(x_(t)∣x_(t-Delta t))+log p_(t)(x_(t-Delta t))+O(Delta t)],[=-(1)/(2sigma_(q)^(2)Delta t)||x_(t-Delta t)-x_(t)||_(2)^(2)+log p_(t)(x_(t-Delta t))],[=-(1)/(2sigma_(q)^(2)Delta t)||x_(t-Delta t)-x_(t)||_(2)^(2)],[+log _p_(t)(x_(t))+(:grad_(x)log p_(t)(x_(t)),(x_(t-Delta t)-x_(t)):)+O(Delta t)],[=-(1)/(2sigma_(q)^(2)Delta t)(||x_(t-Delta t)-x_(t)||_(2)^(2)-2sigma_(q)^(2)Delta t(:grad_(x)log p_(t)(x_(t)),(x_(t-Delta t)-x_(t)):))],[=-(1)/(2sigma_(q)^(2)Delta t)||x_(t-Delta t)-x_(t)-sigma_(q)^(2)Delta tgrad_(x)log p_(t)(x_(t))||_(2)^(2)+C],[=-(1)/(2sigma_(q)^(2)Delta t)||x_(t-Delta t)-mu||_(2)^(2)]:}\begin{aligned} & \log p\left(x_{t-\Delta t} \mid x_{t}\right)=\log p\left(x_{t} \mid x_{t-\Delta t}\right)+\log p_{t-\Delta t}\left(x_{t-\Delta t}\right)-\log p_{t}\left(x_{t}\right) \\ & =\log p\left(x_{t} \mid x_{t-\Delta t}\right)+\log p_{t}\left(x_{t-\Delta t}\right)+\mathcal{O}(\Delta t) \\ & =-\frac{1}{2 \sigma_{q}^{2} \Delta t}\left\|x_{t-\Delta t}-x_{t}\right\|_{2}^{2}+\log p_{t}\left(x_{t-\Delta t}\right) \\ & =-\frac{1}{2 \sigma_{q}^{2} \Delta t}\left\|x_{t-\Delta t}-x_{t}\right\|_{2}^{2} \\ & +\underline{\log } p_{t}\left(x_{t}\right)+\left\langle\nabla_{x} \log p_{t}\left(x_{t}\right),\left(x_{t-\Delta t}-x_{t}\right)\right\rangle+\mathcal{O}(\Delta t) \\ & =-\frac{1}{2 \sigma_{q}^{2} \Delta t}\left(\left\|x_{t-\Delta t}-x_{t}\right\|_{2}^{2}-2 \sigma_{q}^{2} \Delta t\left\langle\nabla_{x} \log p_{t}\left(x_{t}\right),\left(x_{t-\Delta t}-x_{t}\right)\right\rangle\right) \\ & =-\frac{1}{2 \sigma_{q}^{2} \Delta t}\left\|x_{t-\Delta t}-x_{t}-\sigma_{q}^{2} \Delta t \nabla_{x} \log p_{t}\left(x_{t}\right)\right\|_{2}^{2}+C \\ & =-\frac{1}{2 \sigma_{q}^{2} \Delta t}\left\|x_{t-\Delta t}-\mu\right\|_{2}^{2} \end{aligned}logp(xtΔtxt)=logp(xtxtΔt)+logptΔt(xtΔt)logpt(xt)=logp(xtxtΔt)+logpt(xtΔt)+O(Δt)=12σq2ΔtxtΔtxt22+logpt(xtΔt)=12σq2ΔtxtΔtxt22+log_pt(xt)+xlogpt(xt),(xtΔtxt)+O(Δt)=12σq2Δt(xtΔtxt222σq2Δtxlogpt(xt),(xtΔtxt))=12σq2ΔtxtΔtxtσq2Δtxlogpt(xt)22+C=12σq2ΔtxtΔtμ22
This is identical, up to additive factors, to the log-density of a Normal distribution with mean μ μ mu\muμ and variance σ q 2 Δ t σ q 2 Δ t sigma_(q)^(2)Delta t\sigma_{q}^{2} \Delta tσq2Δt. Therefore,
(20) p ( x t Δ t x t ) N ( x t Δ t ; μ , σ q 2 Δ t ) (20) p x t Δ t x t N x t Δ t ; μ , σ q 2 Δ t {:(20)p(x_(t-Delta t)∣x_(t))~~N(x_(t-Delta t);mu,sigma_(q)^(2)Delta t):}\begin{equation*} p\left(x_{t-\Delta t} \mid x_{t}\right) \approx \mathcal{N}\left(x_{t-\Delta t} ; \mu, \sigma_{q}^{2} \Delta t\right) \tag{20} \end{equation*}(20)p(xtΔtxt)N(xtΔt;μ,σq2Δt)
Reflecting on this derivation, the main idea was that for small enough Δ t Δ t Delta t\Delta tΔt, the Bayes-rule expansion of the reverse process p ( x t Δ t p x t Δ t p(x_(t-Delta t)∣:}p\left(x_{t-\Delta t} \mid\right.p(xtΔt x t ) x t {:x_(t))\left.x_{t}\right)xt) is dominated by the term p ( x t x t Δ t ) p x t x t Δ t p(x_(t)∣x_(t-Delta t))p\left(x_{t} \mid x_{t-\Delta t}\right)p(xtxtΔt), from the forward process. This is intuitively why the reverse process and the forward process have the same functional form (both are Gaussian here) 17 17 ^(17){ }^{17}17.
Technical Details [Optional]. The meticulous reader may notice that Claim 1 is not obviously sufficient to imply correctness of the entire DDPM algorithm. The issue is: as we scale down Δ t Δ t Delta t\Delta tΔt, the error in our per-step approximation (Equation 16) decreases, but the number of total steps required increases. So if the per-step error does not decrease fast enough (as a function of Δ t Δ t Delta t\Delta tΔt ), then these errors could accumulate to a non-negligible error by the final step. Thus, we need to quantify how fast the per-step error decays. Lemma 1 below is one way of quantifying this: it states that if the step-size (i.e. variance of the per-step noise) is σ 2 σ 2 sigma^(2)\sigma^{2}σ2, then the KL error of the per-step Gaussian approximation is O ( σ 4 ) O σ 4 O(sigma^(4))\mathcal{O}\left(\sigma^{4}\right)O(σ4). This decay rate is fast enough, because the number of steps only grows as 18 Ω ( 1 / σ 2 ) 18 Ω 1 / σ 2 ^(18)Omega(1//sigma^(2)){ }^{18} \Omega\left(1 / \sigma^{2}\right)18Ω(1/σ2).
Lemma 1. Let p ( x ) p ( x ) p(x)p(x)p(x) be an arbitrary density over R R R\mathbb{R}R, with bounded 1st to 4 th 4 th  4^("th ")4^{\text {th }}4th  order derivatives. Consider the joint distribution ( x 0 , x 1 ) x 0 , x 1 (x_(0),x_(1))\left(x_{0}, x_{1}\right)(x0,x1), where x 0 p x 0 p x_(0)∼px_{0} \sim px0p and x 1 x 0 + N ( 0 , σ 2 ) x 1 x 0 + N 0 , σ 2 x_(1)∼x_(0)+N(0,sigma^(2))x_{1} \sim x_{0}+\mathcal{N}\left(0, \sigma^{2}\right)x1x0+N(0,σ2). Then, for any conditioning z R z R z inRz \in \mathbb{R}zR, we have
(21) KL ( N ( μ z , σ 2 ) | | p x 0 x 1 ( x 1 = z ) ) O ( σ 4 ) (21) KL N μ z , σ 2 | | p x 0 x 1 x 1 = z O σ 4 {:(21)KL(N(mu_(z),sigma^(2))||p_(x_(0)∣x_(1))(*∣x_(1)=z)) <= O(sigma^(4)):}\begin{equation*} \operatorname{KL}\left(\mathcal{N}\left(\mu_{z}, \sigma^{2}\right)|| p_{x_{0} \mid x_{1}}\left(\cdot \mid x_{1}=z\right)\right) \leq O\left(\sigma^{4}\right) \tag{21} \end{equation*}(21)KL(N(μz,σ2)||px0x1(x1=z))O(σ4)
Drop constants involving only x t x t x_(t)x_{t}xt
Since p t Δ t ( ) = p t ( ) + Δ t t p t ( ) p t Δ t ( ) = p t ( ) + Δ t t p t ( ) p_(t-Delta t)(*)=p_(t)(*)+Delta t(del)/(del t)p_(t)(*)p_{t-\Delta t}(\cdot)=p_{t}(\cdot)+\Delta t \frac{\partial}{\partial t} p_{t}(\cdot)ptΔt()=pt()+Δttpt().
Definition of log p ( x t x t Δ t ) log p x t x t Δ t log p(x_(t)∣x_(t-Delta t))\log p\left(x_{t} \mid x_{t-\Delta t}\right)logp(xtxtΔt)
Taylor expand around x t x t x_(t)x_{t}xt and drop constants.
Complete the square in ( x t Δ t x t ) x t Δ t x t (x_(t-Delta t)-x_(t))\left(x_{t-\Delta t}-x_{t}\right)(xtΔtxt), and drop constant C C CCC involving only x t x t x_(t)x_{t}xt.
For μ := x t + ( σ q 2 Δ t ) x log p t ( x t ) μ := x t + σ q 2 Δ t x log p t x t mu:=x_(t)+(sigma_(q)^(2)Delta t)grad_(x)log p_(t)(x_(t))\mu:=x_{t}+\left(\sigma_{q}^{2} \Delta t\right) \nabla_{x} \log p_{t}\left(x_{t}\right)μ:=xt+(σq2Δt)xlogpt(xt)
17 17 ^(17){ }^{17}17 This general relationship between forward and reverse processes holds somewhat more generally than just Gaussian diffusion; see e.g. the discussion in Sohl-Dickstein et al. [2015].
where
(22) μ z := z + σ 2 log p ( z ) (22) μ z := z + σ 2 log p ( z ) {:(22)mu_(z):=z+sigma^(2)grad log p(z):}\begin{equation*} \mu_{z}:=z+\sigma^{2} \nabla \log p(z) \tag{22} \end{equation*}(22)μz:=z+σ2logp(z)
It is possible to prove Lemma 1 by doing essentially a careful Taylor expansion; we include the full proof in Appendix B.1.

2.2 Algorithms

Pseudocode listings 1 and 2 give the explicit DDPM train loss and sampling code. To train 19 19 ^(19){ }^{19}19 the network f θ f θ f_(theta)f_{\theta}fθ, we must minimize the expected loss L θ loss L θ loss L_(theta)\operatorname{loss} L_{\theta}lossLθ output by Pseudocode 1, typically by backpropagation.
Pseudocode 3 describes the closely-related DDIM sampler, which
19 19 ^(19){ }^{19}19 Note that the training procedure
optimizes f θ f θ f_(theta)f_{\theta}fθ for all timesteps t t ttt si-
multaneously, by sampling t [ 0 , 1 ] t [ 0 , 1 ] t in[0,1]t \in[0,1]t[0,1]
uniformly in Line 2 . will be discussed later in Section 3.
Pseudocode 2: DDPM sampling (Code for
Algorithm 1)
    Input: Trained model $f_{\theta}$.
    Data: Terminal variance $\sigma_{q}$; step-size $\Delta t$.
    Output: $x_{0}$
    $x_{1} \leftarrow \mathcal{N}\left(0, \sigma_{q}^{2}\right)$
    for $t=1,(1-\Delta t),(1-2 \Delta t), \ldots, \Delta t$ do
        $\eta \leftarrow \mathcal{N}\left(0, \sigma_{q}^{2} \Delta t\right)$
        $x_{t-\Delta t} \leftarrow f_{\theta}\left(x_{t}, t\right)+\eta$
    end
    return $x_{0}$

2.3 Variance Reduction: Predicting x 0 x 0 x_(0)x_{0}x0

Thus far, our diffusion models have been trained to predict E [ x t Δ t x t ] E x t Δ t x t E[x_(t-Delta t)∣x_(t)]\mathbb{E}\left[x_{t-\Delta t} \mid x_{t}\right]E[xtΔtxt] : this is what Algorithm 1 requires, and what the training procedure of Pseudocode 1 produces. However, many practical
diffusion implementations actually train to predict E [ x 0 x t ] E x 0 x t E[x_(0)∣x_(t)]\mathbb{E}\left[x_{0} \mid x_{t}\right]E[x0xt], i.e. to predict the expectation of the initial point x 0 x 0 x_(0)x_{0}x0 instead of the previous point x t Δ t x t Δ t x_(t-Delta t)x_{t-\Delta t}xtΔt. This difference turns out to be just a variance reduction trick, which estimates the same quantity in expectation. Formally, the two quantities can be related as follows:
Claim 2. For the Gaussian diffusion setting of Section 1.3 , we have:
(23) E [ ( x t Δ t x t ) x t ] = Δ t t E [ ( x 0 x t ) x t ] (23) E x t Δ t x t x t = Δ t t E x 0 x t x t {:(23)E[(x_(t-Delta t)-x_(t))∣x_(t)]=(Delta t)/(t)E[(x_(0)-x_(t))∣x_(t)]:}\begin{equation*} \mathbb{E}\left[\left(x_{t-\Delta t}-x_{t}\right) \mid x_{t}\right]=\frac{\Delta t}{t} \mathbb{E}\left[\left(x_{0}-x_{t}\right) \mid x_{t}\right] \tag{23} \end{equation*}(23)E[(xtΔtxt)xt]=ΔttE[(x0xt)xt]
Or equivalently:
(24) E [ x t Δ t x t ] = ( Δ t t ) E [ x 0 x t ] + ( 1 Δ t t ) x t (24) E x t Δ t x t = Δ t t E x 0 x t + 1 Δ t t x t {:(24)E[x_(t-Delta t)∣x_(t)]=((Delta t)/(t))E[x_(0)∣x_(t)]+(1-(Delta t)/(t))x_(t):}\begin{equation*} \mathbb{E}\left[x_{t-\Delta t} \mid x_{t}\right]=\left(\frac{\Delta t}{t}\right) \mathbb{E}\left[x_{0} \mid x_{t}\right]+\left(1-\frac{\Delta t}{t}\right) x_{t} \tag{24} \end{equation*}(24)E[xtΔtxt]=(Δtt)E[x0xt]+(1Δtt)xt
This claim implies that if we want to estimate E [ x t Δ t x t ] E x t Δ t x t E[x_(t-Delta t)∣x_(t)]\mathbb{E}\left[x_{t-\Delta t} \mid x_{t}\right]E[xtΔtxt], we can instead estimate E [ x 0 x t ] E x 0 x t E[x_(0)∣x_(t)]\mathbb{E}\left[x_{0} \mid x_{t}\right]E[x0xt] and then then essentially divide by ( t / Δ t ) ( t / Δ t ) (t//Delta t)(t / \Delta t)(t/Δt), which is the number of steps taken thus far. The variance-reduced versions of the DDPM training and sampling algorithms do exactly this; we include them in Appendix B.g.
The intuition behind Claim 2 is illustrated in Figure 3: first, observe that predicting x t Δ t x t Δ t x_(t-Delta t)x_{t-\Delta t}xtΔt given x t x t x_(t)x_{t}xt is equivalent to predicting the last noise step, which is η t Δ t = ( x t x t Δ t ) η t Δ t = x t x t Δ t eta_(t-Delta t)=(x_(t)-x_(t-Delta t))\eta_{t-\Delta t}=\left(x_{t}-x_{t-\Delta t}\right)ηtΔt=(xtxtΔt) in the forward process of Equation (11). But, if we are only given the final x t x t x_(t)x_{t}xt, then all of the previous noise steps { η i } i < t η i i < t {eta_(i)}_(i < t)\left\{\eta_{i}\right\}_{i<t}{ηi}i<t intuitively "look the same"- we cannot distinguish between noise that was added at the last step from noise that was added at the 5 th step, for example. By this symmetry, we can conclude that all of the individual noise steps are distributed identically (though not independently) given x t x t x_(t)x_{t}xt. Thus, instead of estimating a single noise step, we can equivalently estimate the average of all prior noise steps, which has much lower variance. There are ( t / Δ t ) ( t / Δ t ) (t//Delta t)(t / \Delta t)(t/Δt) elapsed noise steps by time t t ttt, so we divide the total noise by this quantity in Equation 23 to compute the average. See Appendix B. 8 for a formal proof.
Word of warning: Diffusion models should always be trained to estimate expectations. In particular, when we train a model to predict E [ x 0 x t ] E x 0 x t E[x_(0)∣x_(t)]\mathbb{E}\left[x_{0} \mid x_{t}\right]E[x0xt], we should not think of this as trying to learn "how to sample from the distribution p ( x 0 x t ) p x 0 x t p(x_(0)∣x_(t))^('')p\left(x_{0} \mid x_{t}\right)^{\prime \prime}p(x0xt). For example, if we are training an image diffusion model, then the optimal model will output E [ x 0 x t ] E x 0 x t E[x_(0)∣x_(t)]\mathbb{E}\left[x_{0} \mid x_{t}\right]E[x0xt] which will look like a blurry mix of images (e.g. Figure 1 b 1 b 1b1 b1b in Karras et al. [2022]) — it will not look like an actual image sample. It is good to keep in mind that when diffusion papers colloquially discuss models "predicting x 0 x 0 x_(0)x_{0}x0 ", they do not mean producing something that looks like an actual sample of x 0 x 0 x_(0)x_{0}x0.
Figure 3: The intuition behind Claim 2. Given x t x t x_(t)x_{t}xt, the final noise step η t Δ t η t Δ t eta_(t-Delta t)\eta_{t-\Delta t}ηtΔt is distributed identically as all other noise steps, intuitively because we only know the sum x t = x 0 + i η i x t = x 0 + i η i x_(t)=x_(0)+sum_(i)eta_(i)x_{t}=x_{0}+\sum_{i} \eta_{i}xt=x0+iηi.

2.4 Diffusions as SDEs [Optional]

In this section 20 20 ^(20){ }^{20}20, we connect the discrete-time processes we have discussed so far to stochastic differential equations (SDEs). In the continuous limit, as Δ t 0 Δ t 0 Delta t rarr0\Delta t \rightarrow 0Δt0, our discrete diffusion process turns into a stochastic differential equation. SDEs can also represent many other diffusion variants (corresponding to different drift and diffusion terms), offering flexibility in design choices, like scaling and noise-scheduling. The SDE perspective is powerful because existing theory provides a general closed-form solution for the time-reversed SDE. Discretization of the reverse-time SDE for our particular diffusion immediately yields the sampler we derived in this section, but reverse-time SDEs for other diffusion variants are also available automatically (and can then be solved with any off-the-shelf or custom SDE solver), enabling better training and sampling strategies as we will discuss further in Section 5. Though we mention these connections only briefly here, the SDE perspective has had significant impact on the field. For a more detailed discussion, we recommend Yang Song's blog post [Song, 2021].

The Limiting SDE

Recall our discrete update rule:
x t + Δ t = x t + σ q Δ t ξ , ξ N ( 0 , 1 ) . x t + Δ t = x t + σ q Δ t ξ , ξ N ( 0 , 1 ) . x_(t+Delta t)=x_(t)+sigma_(q)sqrt(Delta t)xi,quad xi∼N(0,1).x_{t+\Delta t}=x_{t}+\sigma_{q} \sqrt{\Delta t} \xi, \quad \xi \sim \mathcal{N}(0,1) .xt+Δt=xt+σqΔtξ,ξN(0,1).
In this limit as Δ t 0 Δ t 0 Delta t rarr0\Delta t \rightarrow 0Δt0, this corresponds to a zero-drift SDE:
(25) d x = σ q d w (25) d x = σ q d w {:(25)dx=sigma_(q)dw:}\begin{equation*} d x=\sigma_{q} d w \tag{25} \end{equation*}(25)dx=σqdw
where w w www is a Brownian motion. A Brownian motion is a stochastic process with i.i.d. Gaussian increments whose variance scales with Δ t . 21 Δ t . 21 Delta t.^(21)\Delta t .{ }^{21}Δt.21 Very heuristically, we can think of d w lim Δ t 0 Δ t N ( 0 , 1 ) d w lim Δ t 0 Δ t N ( 0 , 1 ) dw∼lim_(Delta t rarr0)sqrt(Delta t)N(0,1)d w \sim \lim _{\Delta t \rightarrow 0} \sqrt{\Delta t} \mathcal{N}(0,1)dwlimΔt0ΔtN(0,1), and thus "derive" (25) by
d x = lim Δ t 0 ( x t + Δ t x t ) = σ q lim Δ t 0 Δ t ξ = σ q d w d x = lim Δ t 0 x t + Δ t x t = σ q lim Δ t 0 Δ t ξ = σ q d w dx=lim_(Delta t rarr0)(x_(t+Delta t)-x_(t))=sigma_(q)lim_(Delta t rarr0)sqrt(Delta t)xi=sigma_(q)dwd x=\lim _{\Delta t \rightarrow 0}\left(x_{t+\Delta t}-x_{t}\right)=\sigma_{q} \lim _{\Delta t \rightarrow 0} \sqrt{\Delta t} \xi=\sigma_{q} d wdx=limΔt0(xt+Δtxt)=σqlimΔt0Δtξ=σqdw
More generally, different variants of diffusion are equivalent to SDEs with different choices of drift and diffusion terms:
(26) d x = f ( x , t ) d t + g ( t ) d w (26) d x = f ( x , t ) d t + g ( t ) d w {:(26)dx=f(x","t)dt+g(t)dw:}\begin{equation*} d x=f(x, t) d t+g(t) d w \tag{26} \end{equation*}(26)dx=f(x,t)dt+g(t)dw
The SDE (25) simply has f = 0 f = 0 f=0f=0f=0 and g = σ q g = σ q g=sigma_(q)g=\sigma_{q}g=σq. This formulation encompasses many other possibilities, though, corresponding to different choices of f , g f , g f,gf, gf,g in the SDE. As we will revisit in Section 5 , this flexibility is important for developing effective algorithms. Two important 20 20 ^(20){ }^{20}20 Sections marked "[Optional]" are advanced material, and can be skipped on first read. None of the main sections depend on Optional material.
choices made in practice are tuning the noise schedule and scaling x t x t x_(t)x_{t}xt; together these can help to control the variance of x t x t x_(t)x_{t}xt, and control how much we focus on different noise levels. Adopting a flexible noise schedule { σ t } σ t {sigma_(t)}\left\{\sigma_{t}\right\}{σt} in place of the fixed schedule σ t σ q t σ t σ q t sigma_(t)-=sigma_(q)sqrtt\sigma_{t} \equiv \sigma_{q} \sqrt{t}σtσqt corresponds to the SDE [Song et al., 2020]
x t N ( x 0 , σ t 2 ) x t = x t Δ t + σ t 2 σ t Δ t 2 z t Δ t d x = d d t σ 2 ( t ) d w x t N x 0 , σ t 2 x t = x t Δ t + σ t 2 σ t Δ t 2 z t Δ t d x = d d t σ 2 ( t ) d w x_(t)∼N(x_(0),sigma_(t)^(2))Longleftrightarrowx_(t)=x_(t-Delta t)+sqrt(sigma_(t)^(2)-sigma_(t-Delta t)^(2))z_(t-Delta t)Longleftrightarrow dx=sqrt((d)/(dt)sigma^(2)(t))dwx_{t} \sim \mathcal{N}\left(x_{0}, \sigma_{t}^{2}\right) \Longleftrightarrow x_{t}=x_{t-\Delta t}+\sqrt{\sigma_{t}^{2}-\sigma_{t-\Delta t}^{2}} z_{t-\Delta t} \Longleftrightarrow d x=\sqrt{\frac{d}{d t} \sigma^{2}(t)} d wxtN(x0,σt2)xt=xtΔt+σt2σtΔt2ztΔtdx=ddtσ2(t)dw.
If we also wish to scale each x t x t x_(t)x_{t}xt by a factor s ( t ) s ( t ) s(t)s(t)s(t), Karras et al. [2022] show that this corresponds to the SDE 22 22 ^(22){ }^{22}22
x t N ( s ( t ) x 0 , s ( t ) 2 σ ( t ) 2 ) f ( x ) = s ˙ ( t ) s ( t ) x , g ( t ) = s ( t ) 2 σ ˙ ( t ) σ ( t ) x t N s ( t ) x 0 , s ( t ) 2 σ ( t ) 2 f ( x ) = s ˙ ( t ) s ( t ) x , g ( t ) = s ( t ) 2 σ ˙ ( t ) σ ( t ) x_(t)∼N(s(t)x_(0),s(t)^(2)sigma(t)^(2))Longleftrightarrow f(x)=((s^(˙))(t))/(s(t))x,quad g(t)=s(t)sqrt(2sigma^(˙)(t)sigma(t))x_{t} \sim \mathcal{N}\left(s(t) x_{0}, s(t)^{2} \sigma(t)^{2}\right) \Longleftrightarrow f(x)=\frac{\dot{s}(t)}{s(t)} x, \quad g(t)=s(t) \sqrt{2 \dot{\sigma}(t) \sigma(t)}xtN(s(t)x0,s(t)2σ(t)2)f(x)=s˙(t)s(t)x,g(t)=s(t)2σ˙(t)σ(t).
22 As a sketch of how f arises, let's ignore the noise and note that: x t = s ( t ) x 0 x t + Δ t = s ( t + Δ t ) s ( t ) x t = x t + s ( t ) s ( t + Δ t ) s ( t ) x t d x / d t = s ˙ s x 22  As a sketch of how  f  arises, let's   ignore the noise and note that:  x t = s ( t ) x 0 x t + Δ t = s ( t + Δ t ) s ( t ) x t = x t + s ( t ) s ( t + Δ t ) s ( t ) x t d x / d t = s ˙ s x {:[^(22)" As a sketch of how "f" arises, let's "],[" ignore the noise and note that: "],[qquad{:[x_(t)=s(t)x_(0)],[Longleftrightarrowx_(t+Delta t)=(s(t+Delta t))/(s(t))x_(t)],[=x_(t)+(s(t)-s(t+Delta t))/(s(t))x_(t)],[Longleftrightarrow dx//dt=((s^(˙)))/(s)x]:}]:}\begin{aligned} & { }^{22} \text { As a sketch of how } f \text { arises, let's } \\ & \text { ignore the noise and note that: } \\ & \qquad \begin{aligned} x_{t} & =s(t) x_{0} \\ \Longleftrightarrow x_{t+\Delta t} & =\frac{s(t+\Delta t)}{s(t)} x_{t} \\ & =x_{t}+\frac{s(t)-s(t+\Delta t)}{s(t)} x_{t} \\ \Longleftrightarrow d x / d t & =\frac{\dot{s}}{s} x \end{aligned} \end{aligned}22 As a sketch of how f arises, let's  ignore the noise and note that: xt=s(t)x0xt+Δt=s(t+Δt)s(t)xt=xt+s(t)s(t+Δt)s(t)xtdx/dt=s˙sx

Reverse-Time SDE

The time-reversal of an SDE runs the process backward in time. Reversetime SDEs are the continuous-time analog of samplers like DDPM. A deep result due to Anderson [1982] (and nicely re-derived in Winkler [2021]) states that the time-reversal of SDE (26) is given by:
(27) d x = ( f ( x , t ) g ( t ) 2 x log p t ( x ) ) d t + g ( t ) d w ¯ (27) d x = f ( x , t ) g ( t ) 2 x log p t ( x ) d t + g ( t ) d w ¯ {:(27)dx=(f(x,t)-g(t)^(2)grad_(x)log p_(t)(x))dt+g(t)d bar(w):}\begin{equation*} d x=\left(f(x, t)-g(t)^{2} \nabla_{x} \log p_{t}(x)\right) d t+g(t) d \bar{w} \tag{27} \end{equation*}(27)dx=(f(x,t)g(t)2xlogpt(x))dt+g(t)dw¯
That is, SDE (27) tells us how to run any SDE of the form (26) backward in time! This means that we don't have to re-derive the reversal in each case, and we can choose any SDE solver to yield a practical sampler. But nothing is free: we sill cannot use (27) directly to sample backward, since the term x log p t ( x ) x log p t ( x ) grad_(x)log p_(t)(x)\nabla_{x} \log p_{t}(x)xlogpt(x) - which is in fact the score that previously appeared in equation 18 - is unknown in general, since it depends on p t p t p_(t)p_{t}pt. However, if we can learn the score, then we can solve the reverse SDE. This is analogous to discrete diffusion, where the forward process is easy to model (it just adds noise), while the reverse process must be learned.
Let us take a moment to discuss the score, x log p t ( x ) x log p t ( x ) grad_(x)log p_(t)(x)\nabla_{x} \log p_{t}(x)xlogpt(x), which plays a central role. Intuitively, since the score "points toward higher probability", it helps to reverse the diffusion process, which "flattens out" the probability as it runs forward. The score is also related to the conditional expectation of x 0 x 0 x_(0)x_{0}x0 given x t x t x_(t)x_{t}xt. Recall that in the discrete case
σ q 2 Δ t log p t ( x t ) = E [ x t Δ t x t x t ] = Δ t t E [ x 0 x t x t ] σ q 2 Δ t log p t x t = E x t Δ t x t x t = Δ t t E x 0 x t x t sigma_(q)^(2)Delta t grad log p_(t)(x_(t))=E[x_(t-Delta t)-x_(t)∣x_(t)]=(Delta t)/(t)E[x_(0)-x_(t)∣x_(t)]\sigma_{q}^{2} \Delta t \nabla \log p_{t}\left(x_{t}\right)=\mathbb{E}\left[x_{t-\Delta t}-x_{t} \mid x_{t}\right]=\frac{\Delta t}{t} \mathbb{E}\left[x_{0}-x_{t} \mid x_{t}\right]σq2Δtlogpt(xt)=E[xtΔtxtxt]=ΔttE[x0xtxt]
(by equations 18,23 ).
Similarly, in the continuous case we have 23 23 ^(23){ }^{23}23
(28) σ q 2 log p t ( x t ) = 1 t E [ x 0 x t x t ] (28) σ q 2 log p t x t = 1 t E x 0 x t x t {:(28)sigma_(q)^(2)grad log p_(t)(x_(t))=(1)/(t)E[x_(0)-x_(t)∣x_(t)]:}\begin{equation*} \sigma_{q}^{2} \nabla \log p_{t}\left(x_{t}\right)=\frac{1}{t} \mathbb{E}\left[x_{0}-x_{t} \mid x_{t}\right] \tag{28} \end{equation*}(28)σq2logpt(xt)=1tE[x0xtxt]
Returning to the reverse SDE, we can show that its discretization
yields the DDPM sampler of Claim 1 as a special case. The reversal of the simple SDE (25) is:
(29) d x = σ q 2 x log p t ( x ) d t + σ q d w ¯ (30) = 1 t E [ x 0 x t x t ] d t + σ q d w ¯ (29) d x = σ q 2 x log p t ( x ) d t + σ q d w ¯ (30) = 1 t E x 0 x t x t d t + σ q d w ¯ {:[(29)dx=-sigma_(q)^(2)grad_(x)log p_(t)(x)dt+sigma_(q)d bar(w)],[(30)=-(1)/(t)E[x_(0)-x_(t)∣x_(t)]dt+sigma_(q)d bar(w)]:}\begin{align*} d x & =-\sigma_{q}^{2} \nabla_{x} \log p_{t}(x) d t+\sigma_{q} d \bar{w} \tag{29}\\ & =-\frac{1}{t} \mathbb{E}\left[x_{0}-x_{t} \mid x_{t}\right] d t+\sigma_{q} d \bar{w} \tag{30} \end{align*}(29)dx=σq2xlogpt(x)dt+σqdw¯(30)=1tE[x0xtxt]dt+σqdw¯
The discretization is
(31) x t x t Δ t = Δ t t E [ x 0 x t x t ] + N ( 0 , σ q 2 Δ t ) = E [ x t Δ t x t x t ] + N ( 0 , σ q 2 Δ t ) (32) x t Δ t = E [ x t Δ t x t ] + N ( 0 , σ q 2 Δ t ) (31) x t x t Δ t = Δ t t E x 0 x t x t + N 0 , σ q 2 Δ t = E x t Δ t x t x t + N 0 , σ q 2 Δ t (32) x t Δ t = E x t Δ t x t + N 0 , σ q 2 Δ t {:[(31)x_(t)-x_(t-Delta t)=-(Delta t)/(t)E[x_(0)-x_(t)∣x_(t)]+N(0,sigma_(q)^(2)Delta t)],[=-E[x_(t-Delta t)-x_(t)∣x_(t)]+N(0,sigma_(q)^(2)Delta t)],[(32)Longrightarrowx_(t-Delta t)=E[x_(t-Delta t)∣x_(t)]+N(0,sigma_(q)^(2)Delta t)]:}\begin{align*} x_{t}-x_{t-\Delta t} & =-\frac{\Delta t}{t} \mathbb{E}\left[x_{0}-x_{t} \mid x_{t}\right]+\mathcal{N}\left(0, \sigma_{q}^{2} \Delta t\right) \tag{31}\\ & =-\mathbb{E}\left[x_{t-\Delta t}-x_{t} \mid x_{t}\right]+\mathcal{N}\left(0, \sigma_{q}^{2} \Delta t\right) \\ \Longrightarrow x_{t-\Delta t} & =\mathbb{E}\left[x_{t-\Delta t} \mid x_{t}\right]+\mathcal{N}\left(0, \sigma_{q}^{2} \Delta t\right) \tag{32} \end{align*}(31)xtxtΔt=ΔttE[x0xtxt]+N(0,σq2Δt)=E[xtΔtxtxt]+N(0,σq2Δt)(32)xtΔt=E[xtΔtxt]+N(0,σq2Δt)
which is exactly the stochastic (DDPM) sampler derived in Claim 1.

3 Deterministic Sampling: DDIM

We will now show a deterministic reverse sampler for Gaussian diffusion - which appears similar to the stochastic sampler of the previous section, but is conceptually quite different. This sampler is equivalent to the DDIM 24 24 ^(24){ }^{24}24 update of Song et al. [2021], adapted to in our simplified setting.
We consider the same Gaussian diffusion setup as the previous section, with the joint distribution ( x 0 , x Δ t , , x 1 ) x 0 , x Δ t , , x 1 (x_(0),x_(Delta t),dots,x_(1))\left(x_{0}, x_{\Delta t}, \ldots, x_{1}\right)(x0,xΔt,,x1) and conditional expectation function μ t ( z ) := E [ x t x t + Δ t = z ] μ t ( z ) := E x t x t + Δ t = z mu_(t)(z):=E[x_(t)∣x_(t+Delta t)=z]\mu_{t}(z):=\mathbb{E}\left[x_{t} \mid x_{t+\Delta t}=z\right]μt(z):=E[xtxt+Δt=z]. The reverse sampler is defined below, and listed explicitly in Pseudocode 3.
Algorithm 2: Deterministic Reverse Sampler (DDIM-like)
For input sample x t x t x_(t)x_{t}xt, and step index t t ttt, output:
(33) x ^ t Δ t x t + λ ( μ t Δ t ( x t ) x t ) (33) x ^ t Δ t x t + λ μ t Δ t x t x t {:(33) widehat(x)_(t-Delta t)larrx_(t)+lambda(mu_(t-Delta t)(x_(t))-x_(t)):}\begin{equation*} \widehat{x}_{t-\Delta t} \leftarrow x_{t}+\lambda\left(\mu_{t-\Delta t}\left(x_{t}\right)-x_{t}\right) \tag{33} \end{equation*}(33)x^tΔtxt+λ(μtΔt(xt)xt)
where λ := ( σ t σ t Δ t + σ t ) λ := σ t σ t Δ t + σ t lambda:=((sigma_(t))/(sigma_(t-Delta t)+sigma_(t)))\lambda:=\left(\frac{\sigma_{t}}{\sigma_{t-\Delta t}+\sigma_{t}}\right)λ:=(σtσtΔt+σt) and σ t σ q t σ t σ q t sigma_(t)-=sigma_(q)sqrtt\sigma_{t} \equiv \sigma_{q} \sqrt{t}σtσqt from Equation (12).
How do we show that this defines a valid reverse sampler? Since Algorithm 2 is deterministic, it does not make sense to argue that it samples from p ( x t Δ t x t ) p x t Δ t x t p(x_(t-Delta t)∣x_(t))p\left(x_{t-\Delta t} \mid x_{t}\right)p(xtΔtxt), as we argued for the DDPM-like stochastic sampler. Instead, we will directly show that Equation (33) implements a valid transport map between the marginal distributions p t p t p_(t)p_{t}pt and p t Δ t p t Δ t p_(t-Delta t)p_{t-\Delta t}ptΔt. That is, if we let F t F t F_(t)F_{t}Ft be the update of Equation (33):
(34) F t ( z ) := z + λ ( μ t Δ t ( z ) z ) (35) = z + λ ( E [ x t Δ t x t = z ] z ) (34) F t ( z ) := z + λ μ t Δ t ( z ) z (35) = z + λ E x t Δ t x t = z z {:[(34)F_(t)(z):=z+lambda(mu_(t-Delta t)(z)-z)],[(35)=z+lambda(E[x_(t-Delta t)∣x_(t)=z]-z)]:}\begin{align*} F_{t}(z) & :=z+\lambda\left(\mu_{t-\Delta t}(z)-z\right) \tag{34}\\ & =z+\lambda\left(\mathbb{E}\left[x_{t-\Delta t} \mid x_{t}=z\right]-z\right) \tag{35} \end{align*}(34)Ft(z):=z+λ(μtΔt(z)z)(35)=z+λ(E[xtΔtxt=z]z)
then we want to show that 25 25 ^(25){ }^{25}25
(36) F t p t p t Δ t . (36) F t p t p t Δ t . {:(36)F_(t)♯p_(t)~~p_(t-Delta t).:}\begin{equation*} F_{t} \sharp p_{t} \approx p_{t-\Delta t} . \tag{36} \end{equation*}(36)FtptptΔt.
Proof overview: The usual way to prove this is to use tools from stochastic calculus, but we'll present an elementary derivation. Our strategy will be to first show that Algorithm 2 is correct in the simplest case of a point-mass distribution, and then lift this result to full distributions by marginalizing appropriately. For the experts, this is similar to "flow-matching" proofs.

3.1 Case 1: Single Point

Let's first understand the simple case where the target distribution p 0 p 0 p_(0)p_{0}p0 is a single point mass in R d R d R^(d)\mathbb{R}^{d}Rd. Without loss of generality 26 26 ^(26){ }^{26}26, we can assume the point is at x 0 = 0 x 0 = 0 x_(0)=0x_{0}=0x0=0. Is Algorithm 2 correct in this case?
24 24 ^(24){ }^{24}24 DDIM stands for Denoising Diffusion Implicit Models, which reflects a perspective used in the original derivation of Song et al. [2021]. Our derivation follows a different perspective, and the "implicit" aspect will not be important to us.
25 25 ^(25){ }^{25}25 The notation F p F p F♯pF \sharp pFp means the distribution of { F ( x ) } x p { F ( x ) } x p {F(x)}_(x∼p)\{F(x)\}_{x \sim p}{F(x)}xp. This is called the pushforward of p p ppp by the function F F FFF.
To reason about correctness, we want to consider the distributions of x t x t x_(t)x_{t}xt and x t Δ t x t Δ t x_(t-Delta t)x_{t-\Delta t}xtΔt for arbitrary step t t ttt. According to the diffusion forward process (Equation 11), at time t t ttt the relevant random variables are 27 27 ^(27){ }^{27}27
x 0 = 0 (deterministically) x t Δ t N ( x 0 , σ t Δ t 2 ) x t N ( x t Δ t , σ t 2 σ t Δ t 2 ) x 0 = 0  (deterministically)  x t Δ t N x 0 , σ t Δ t 2 x t N x t Δ t , σ t 2 σ t Δ t 2 {:[x_(0)=0quad" (deterministically) "],[x_(t-Delta t)∼N(x_(0),sigma_(t-Delta t)^(2))],[x_(t)∼N(x_(t-Delta t),sigma_(t)^(2)-sigma_(t-Delta t)^(2))]:}\begin{aligned} x_{0} & =0 \quad \text { (deterministically) } \\ x_{t-\Delta t} & \sim \mathcal{N}\left(x_{0}, \sigma_{t-\Delta t}^{2}\right) \\ x_{t} & \sim \mathcal{N}\left(x_{t-\Delta t}, \sigma_{t}^{2}-\sigma_{t-\Delta t}^{2}\right) \end{aligned}x0=0 (deterministically) xtΔtN(x0,σtΔt2)xtN(xtΔt,σt2σtΔt2)
The marginal distribution of x t Δ t x t Δ t x_(t-Delta t)x_{t-\Delta t}xtΔt is p t Δ t = N ( 0 , σ t 1 2 ) p t Δ t = N 0 , σ t 1 2 p_(t-Delta t)=N(0,sigma_(t-1)^(2))p_{t-\Delta t}=\mathcal{N}\left(0, \sigma_{t-1}^{2}\right)ptΔt=N(0,σt12), and the marginal distribution of x t x t x_(t)x_{t}xt is p t = N ( 0 , σ t 2 ) p t = N 0 , σ t 2 p_(t)=N(0,sigma_(t)^(2))p_{t}=\mathcal{N}\left(0, \sigma_{t}^{2}\right)pt=N(0,σt2).
Let us first find some deterministic function G t : R d R d G t : R d R d G_(t):R^(d)rarrR^(d)G_{t}: \mathbb{R}^{d} \rightarrow \mathbb{R}^{d}Gt:RdRd, such that G t p t = p t Δ t G t p t = p t Δ t G_(t)♯p_(t)=p_(t-Delta t)G_{t} \sharp p_{t}=p_{t-\Delta t}Gtpt=ptΔt. There are many possible functions which will work 28 28 ^(28){ }^{28}28, but this is the obvious one:
G t ( z ) := ( σ t Δ t σ t ) z G t ( z ) := σ t Δ t σ t z G_(t)(z):=((sigma_(t-Delta t))/(sigma_(t)))zG_{t}(z):=\left(\frac{\sigma_{t-\Delta t}}{\sigma_{t}}\right) zGt(z):=(σtΔtσt)z
The function G t G t G_(t)G_{t}Gt above simply re-scales the Gaussian distribution of p t p t p_(t)p_{t}pt, to match variance of the Gaussian distribution p t Δ t p t Δ t p_(t-Delta t)p_{t-\Delta t}ptΔt. It turns out this G t G t G_(t)G_{t}Gt is exactly equivalent to the step F t F t F_(t)F_{t}Ft taken by Algorithm 2, which we will now show.
Claim 3. When the target distribution is a point mass p 0 = δ 0 p 0 = δ 0 p_(0)=delta_(0)p_{0}=\delta_{0}p0=δ0, then update F t F t F_(t)F_{t}Ft (as defined in Equation 35) is equivalent to the scaling G t G t G_(t)G_{t}Gt (as defined in Equation 37):
(38) F t G t (38) F t G t {:(38)F_(t)-=G_(t):}\begin{equation*} F_{t} \equiv G_{t} \tag{38} \end{equation*}(38)FtGt
Thus Algorithm 2 defines a reverse sampler for target distribution p 0 = δ 0 p 0 = δ 0 p_(0)=delta_(0)p_{0}=\delta_{0}p0=δ0.
Proof. To apply F t F t F_(t)F_{t}Ft, we need to compute E [ x t Δ t x t ] E x t Δ t x t E[x_(t-Delta t)∣x_(t)]\mathbb{E}\left[x_{t-\Delta t} \mid x_{t}\right]E[xtΔtxt] for our simple distribution. Since ( x t Δ t , x t ) x t Δ t , x t (x_(t-Delta t),x_(t))\left(x_{t-\Delta t}, x_{t}\right)(xtΔt,xt) are jointly Gaussian, this is 29 29 ^(29){ }^{29}29
(39) E [ x t Δ t x t ] = ( σ t Δ t 2 σ t 2 ) x t (39) E x t Δ t x t = σ t Δ t 2 σ t 2 x t {:(39)E[x_(t-Delta t)∣x_(t)]=((sigma_(t-Delta t)^(2))/(sigma_(t)^(2)))x_(t):}\begin{equation*} \mathbb{E}\left[x_{t-\Delta t} \mid x_{t}\right]=\left(\frac{\sigma_{t-\Delta t}^{2}}{\sigma_{t}^{2}}\right) x_{t} \tag{39} \end{equation*}(39)E[xtΔtxt]=(σtΔt2σt2)xt
The rest is algebra:
F t ( x t ) := x t + λ ( E [ x t Δ t x t ] x t ) = x t + ( σ t σ t Δ t + σ t ) ( E [ x t Δ t x t ] x t ) = x t + ( σ t σ t Δ t + σ t ) ( σ t Δ t 2 σ t 2 1 ) x t = ( σ t Δ t σ t ) x t = G t ( x t ) F t x t := x t + λ E x t Δ t x t x t = x t + σ t σ t Δ t + σ t E x t Δ t x t x t = x t + σ t σ t Δ t + σ t σ t Δ t 2 σ t 2 1 x t = σ t Δ t σ t x t = G t x t {:[F_(t)(x_(t)):=x_(t)+lambda(E[x_(t-Delta t)∣x_(t)]-x_(t))],[=x_(t)+((sigma_(t))/(sigma_(t-Delta t)+sigma_(t)))(E[x_(t-Delta t)∣x_(t)]-x_(t))],[=x_(t)+((sigma_(t))/(sigma_(t-Delta t)+sigma_(t)))((sigma_(t-Delta t)^(2))/(sigma_(t)^(2))-1)x_(t)],[=((sigma_(t-Delta t))/(sigma_(t)))x_(t)],[=G_(t)(x_(t))]:}\begin{aligned} F_{t}\left(x_{t}\right) & :=x_{t}+\lambda\left(\mathbb{E}\left[x_{t-\Delta t} \mid x_{t}\right]-x_{t}\right) \\ & =x_{t}+\left(\frac{\sigma_{t}}{\sigma_{t-\Delta t}+\sigma_{t}}\right)\left(\mathbb{E}\left[x_{t-\Delta t} \mid x_{t}\right]-x_{t}\right) \\ & =x_{t}+\left(\frac{\sigma_{t}}{\sigma_{t-\Delta t}+\sigma_{t}}\right)\left(\frac{\sigma_{t-\Delta t}^{2}}{\sigma_{t}^{2}}-1\right) x_{t} \\ & =\left(\frac{\sigma_{t-\Delta t}}{\sigma_{t}}\right) x_{t} \\ & =G_{t}\left(x_{t}\right) \end{aligned}Ft(xt):=xt+λ(E[xtΔtxt]xt)=xt+(σtσtΔt+σt)(E[xtΔtxt]xt)=xt+(σtσtΔt+σt)(σtΔt2σt21)xt=(σtΔtσt)xt=Gt(xt)
We therefore conclude that Algorithm 2 is a correct reverse sampler, since it is equivalent to G t G t G_(t)G_{t}Gt, and G t G t G_(t)G_{t}Gt is valid.
The correctness of Algorithm 2 still holds 30 30 ^(30){ }^{30}30 if x 0 x 0 x_(0)x_{0}x0 is an arbitrary point instead of x 0 = 0 x 0 = 0 x_(0)=0x_{0}=0x0=0, since everything is transitionally symmetric. 27 27 ^(27){ }^{27}27 We omit the Identity matrix in these covariances for notational simplicity. The reader may assume dimension d = 1 d = 1 d=1d=1d=1 without loss of generality.
28 28 ^(28){ }^{28}28 For example, we can always add a rotation around the origin to any valid map.

Abstract

29 29 ^(29){ }^{29}29 Recall the conditional expectation of two jointly Gaussian random variables ( X , Y ) ( X , Y ) (X,Y)(X, Y)(X,Y) is E [ X Y = y ] = μ X + E [ X Y = y ] = μ X + E[X∣Y=y]=mu_(X)+\mathbb{E}[X \mid Y=y]=\mu_{X}+E[XY=y]=μX+ Σ X Y Σ Y Y 1 ( y μ Y ) Σ X Y Σ Y Y 1 y μ Y Sigma_(XY)Sigma_(YY)^(-1)(y-mu_(Y))\Sigma_{X Y} \Sigma_{Y Y}^{-1}\left(y-\mu_{Y}\right)ΣXYΣYY1(yμY), where μ X , μ Y μ X , μ Y mu_(X),mu_(Y)\mu_{X}, \mu_{Y}μX,μY are the respective means, and Σ X Y , Σ Y Y Σ X Y , Σ Y Y Sigma_(XY),Sigma_(YY)\Sigma_{X Y}, \Sigma_{Y Y}ΣXY,ΣYY the cross-covariance of ( X , Y ) ( X , Y ) (X,Y)(X, Y)(X,Y) and covariance of Y Y YYY. Since X = x t Δ t X = x t Δ t X=x_(t-Delta t)X=x_{t-\Delta t}X=xtΔt and Y = x t Y = x t Y=x_(t)Y=x_{t}Y=xt are centered at 0 , we have μ X = μ Y = 0 μ X = μ Y = 0 mu_(X)=mu_(Y)=0\mu_{X}=\mu_{Y}=0μX=μY=0. For the covariance term, since x t = x t Δ t + η x t = x t Δ t + η x_(t)=x_(t-Delta t)+etax_{t}=x_{t-\Delta t}+\etaxt=xtΔt+η we have Σ X Y = Σ X Y = Sigma_(XY)=\Sigma_{X Y}=ΣXY= E [ x t x t Δ t T ] = E [ x t Δ t x t Δ t T ] = σ t Δ t 2 I d E x t x t Δ t T = E x t Δ t x t Δ t T = σ t Δ t 2 I d E[x_(t)x_(t-Delta t)^(T)]=E[x_(t-Delta t)x_(t-Delta t)^(T)]=sigma_(t-Delta t)^(2)I_(d)\mathbb{E}\left[x_{t} x_{t-\Delta t}^{T}\right]=\mathbb{E}\left[x_{t-\Delta t} x_{t-\Delta t}^{T}\right]=\sigma_{t-\Delta t}^{2} I_{d}E[xtxtΔtT]=E[xtΔtxtΔtT]=σtΔt2Id. Similarly, Σ Y Y = E [ x t x t T ] = σ t 2 I d Σ Y Y = E x t x t T = σ t 2 I d Sigma_(YY)=E[x_(t)x_(t)^(T)]=sigma_(t)^(2)I_(d)\Sigma_{Y Y}=\mathbb{E}\left[x_{t} x_{t}^{T}\right]=\sigma_{t}^{2} I_{d}ΣYY=E[xtxtT]=σt2Id. by definition of F t F t F_(t)F_{t}Ft by definition of λ λ lambda\lambdaλ by Equation (39)

3.2 Velocity Fields and Gases

Before we move on, it will be helpful to think of the DDIM update as equivalent to a velocity field, which moves points at time t t ttt to their positions at time ( t Δ t ) ( t Δ t ) (t-Delta t)(t-\Delta t)(tΔt). Specifically, define the vector field
(40) v t ( x t ) := λ Δ t ( E [ x t Δ t x t ] x t ) (40) v t x t := λ Δ t E x t Δ t x t x t {:(40)v_(t)(x_(t)):=(lambda)/(Delta t)(E[x_(t-Delta t)∣x_(t)]-x_(t)):}\begin{equation*} v_{t}\left(x_{t}\right):=\frac{\lambda}{\Delta t}\left(\mathbb{E}\left[x_{t-\Delta t} \mid x_{t}\right]-x_{t}\right) \tag{40} \end{equation*}(40)vt(xt):=λΔt(E[xtΔtxt]xt)
Then the DDIM update algorithm of Equation (33) can be written as:
x ^ t Δ t := x t + λ ( μ t Δ t ( x t ) x t ) (41) = x t + v t ( x t ) Δ t . x ^ t Δ t := x t + λ μ t Δ t x t x t (41) = x t + v t x t Δ t . {:[ widehat(x)_(t-Delta t):=x_(t)+lambda(mu_(t-Delta t)(x_(t))-x_(t))],[(41)=x_(t)+v_(t)(x_(t))Delta t.]:}\begin{align*} \widehat{x}_{t-\Delta t} & :=x_{t}+\lambda\left(\mu_{t-\Delta t}\left(x_{t}\right)-x_{t}\right) \\ & =x_{t}+v_{t}\left(x_{t}\right) \Delta t . \tag{41} \end{align*}x^tΔt:=xt+λ(μtΔt(xt)xt)(41)=xt+vt(xt)Δt.
The physical intuition for v t v t v_(t)v_{t}vt is: imagine a gas of non-interacting particles, with density field given by p t p t p_(t)p_{t}pt. Then, suppose a particle at position z z zzz moves in the direction v t ( z ) v t ( z ) v_(t)(z)v_{t}(z)vt(z). The resulting gas will have density field p t Δ t p t Δ t p_(t-Delta t)p_{t-\Delta t}ptΔt. We write this process as
(42) p t v t p t Δ t . (42) p t v t p t Δ t . {:(42)p_(t)rarr"v_(t)"p_(t-Delta t).:}\begin{equation*} p_{t} \xrightarrow{v_{t}} p_{t-\Delta t} . \tag{42} \end{equation*}(42)ptvtptΔt.
In the limit of small stepsize Δ t Δ t Delta t\Delta tΔt, speaking informally, we can think of v t v t v_(t)v_{t}vt as a velocity field - which specifies the instantaneous velocity of particles moving according to the DDIM algorithm.
As a concrete example, if the target distribution p 0 = δ x 0 p 0 = δ x 0 p_(0)=delta_(x_(0))p_{0}=\delta_{x_{0}}p0=δx0, as in Section 3.1, then the velocity field of DDIM is v t ( x t ) = ( σ t σ t Δ t σ t ) ( x 0 v t x t = σ t σ t Δ t σ t x 0 v_(t)(x_(t))=((sigma_(t)-sigma_(t-Delta t))/(sigma_(t)))(x_(0)-:}v_{t}\left(x_{t}\right)=\left(\frac{\sigma_{t}-\sigma_{t-\Delta t}}{\sigma_{t}}\right)\left(x_{0}-\right.vt(xt)=(σtσtΔtσt)(x0 x t ) / Δ t x t / Δ t {:x_(t))//Delta t\left.x_{t}\right) / \Delta txt)/Δt which is a vector field that always points towards the initial point x 0 x 0 x_(0)x_{0}x0 (see Figure 4).

3.3 Case 2: Two Points

Now let us show Algorithm 2 is correct when the target distribution is a mixture of two points:
(43) p 0 := 1 2 δ a + 1 2 δ b (43) p 0 := 1 2 δ a + 1 2 δ b {:(43)p_(0):=(1)/(2)delta_(a)+(1)/(2)delta_(b):}\begin{equation*} p_{0}:=\frac{1}{2} \delta_{a}+\frac{1}{2} \delta_{b} \tag{43} \end{equation*}(43)p0:=12δa+12δb
for some a , b R d a , b R d a,b inR^(d)a, b \in \mathbb{R}^{d}a,bRd. According to the diffusion forward process, the distribution at time t t ttt will be a mixture of Gaussians 31 31 ^(31){ }^{31}31 :
(44) p t := 1 2 N ( a , σ t 2 ) + 1 2 N ( b , σ t 2 ) . (44) p t := 1 2 N a , σ t 2 + 1 2 N b , σ t 2 . {:(44)p_(t):=(1)/(2)N(a,sigma_(t)^(2))+(1)/(2)N(b,sigma_(t)^(2)).:}\begin{equation*} p_{t}:=\frac{1}{2} \mathcal{N}\left(a, \sigma_{t}^{2}\right)+\frac{1}{2} \mathcal{N}\left(b, \sigma_{t}^{2}\right) . \tag{44} \end{equation*}(44)pt:=12N(a,σt2)+12N(b,σt2).
We want to show that with these distributions p t p t p_(t)p_{t}pt, the DDIM velocity field v t v t v_(t)v_{t}vt (of Equation 40) transports p t v t p t Δ t p t v t p t Δ t p_(t)rarr"v_(t)"p_(t-Delta t)p_{t} \xrightarrow{v_{t}} p_{t-\Delta t}ptvtptΔt.
Let us first try to construct some velocity field v t v t v_(t)^(**)v_{t}^{*}vt such that p t v t p t Δ t p t v t p t Δ t p_(t)rarr"v_(t)^(**)"p_(t-Delta t)p_{t} \xrightarrow{v_{t}^{*}} p_{t-\Delta t}ptvtptΔt. From our result in Section 3.1 — the fact that DDIM update works for single points - we already know velocity fields from Equation (33)
Figure 4: Velocity field v t v t v_(t)v_{t}vt when p 0 = δ x 0 p 0 = δ x 0 p_(0)=delta_(x_(0))p_{0}=\delta_{x_{0}}p0=δx0, overlaid on the Gaussian distribution p t p t p_(t)p_{t}pt.
31 31 ^(31){ }^{31}31 Linearity of the forward process (with respect to p 0 p 0 p_(0)p_{0}p0 ) was important here. That is, roughly speaking, diffusing a distribution is equivalent to diffusing each individual point in that distribution independently; the points don't interact.
which transport each mixture component { a , b } { a , b } {a,b}\{a, b\}{a,b} individually. That is, we know the velocity field v t [ a ] v t [ a ] v_(t)^([a])v_{t}^{[a]}vt[a] defined as
(45) v t [ a ] ( x t ) := λ E x 0 δ a [ x t Δ t x t x t ] (45) v t [ a ] x t := λ E x 0 δ a x t Δ t x t x t {:(45)v_(t)^([a])(x_(t)):=lambdaE_(x_(0)∼delta_(a))[x_(t-Delta t)-x_(t)∣x_(t)]:}\begin{equation*} v_{t}^{[a]}\left(x_{t}\right):=\lambda \underset{x_{0} \sim \delta_{a}}{\mathbb{E}}\left[x_{t-\Delta t}-x_{t} \mid x_{t}\right] \tag{45} \end{equation*}(45)vt[a](xt):=λEx0δa[xtΔtxtxt]
transports 32 32 ^(32){ }^{32}32
(46) N ( a , σ t 2 ) v t [ a ] N ( a , σ t Δ t 2 ) (46) N a , σ t 2 v t [ a ] N a , σ t Δ t 2 {:(46)N(a,sigma_(t)^(2))rarr"v_(t)^([a])"N(a,sigma_(t-Delta t)^(2)):}\begin{equation*} \mathcal{N}\left(a, \sigma_{t}^{2}\right) \xrightarrow{v_{t}^{[a]}} \mathcal{N}\left(a, \sigma_{t-\Delta t}^{2}\right) \tag{46} \end{equation*}(46)N(a,σt2)vt[a]N(a,σtΔt2)
and similarly for v t [ b ] v t [ b ] v_(t)^([b])v_{t}^{[b]}vt[b].
We now want some way of combining these two velocity fields into a single velocity v t v t v_(t)^(**)v_{t}^{*}vt, which transports the mixture:
( 1 2 N ( a , σ t 2 ) + 1 2 N ( b , σ t 2 ) ) p t v t ( 1 2 N ( a , σ t Δ t 2 ) + 1 2 N ( b , σ t Δ t 2 ) ) p t Δ t 1 2 N a , σ t 2 + 1 2 N b , σ t 2 p t v t 1 2 N a , σ t Δ t 2 + 1 2 N b , σ t Δ t 2 p t Δ t ubrace(((1)/(2)N(a,sigma_(t)^(2))+(1)/(2)N(b,sigma_(t)^(2)))ubrace)_(p_(t))rarr"v_(t)^(**)"ubrace(((1)/(2)N(a,sigma_(t-Delta t)^(2))+(1)/(2)N(b,sigma_(t-Delta t)^(2)))ubrace)_(p_(t-Delta t))\underbrace{\left(\frac{1}{2} \mathcal{N}\left(a, \sigma_{t}^{2}\right)+\frac{1}{2} \mathcal{N}\left(b, \sigma_{t}^{2}\right)\right)}_{p_{t}} \xrightarrow{v_{t}^{*}} \underbrace{\left(\frac{1}{2} \mathcal{N}\left(a, \sigma_{t-\Delta t}^{2}\right)+\frac{1}{2} \mathcal{N}\left(b, \sigma_{t-\Delta t}^{2}\right)\right)}_{p_{t-\Delta t}}(12N(a,σt2)+12N(b,σt2))ptvt(12N(a,σtΔt2)+12N(b,σtΔt2))ptΔt
We may be tempted to just take the average velocity field ( v t = v t = (v_(t)^(**)=:}\left(v_{t}^{*}=\right.(vt= 0.5 v t [ a ] + 0.5 v t [ b ] ) 0.5 v t [ a ] + 0.5 v t [ b ] {: 0.5v_(t)^([a])+0.5v_(t)^([b]))\left.0.5 v_{t}^{[a]}+0.5 v_{t}^{[b]}\right)0.5vt[a]+0.5vt[b]), but this is incorrect. The correct combined velocity v t v t v_(t)^(**)v_{t}^{*}vt is a weighted-average of the individual velocity fields, weighted by their corresponding density fields 33 33 ^(33){ }^{33}33.
(48) v t ( x t ) = v t [ a ] ( x t ) p ( x t x 0 = a ) + v t [ b ] ( x t ) p ( x t x 0 = b ) p ( x t x 0 = a ) + p ( x t x 0 = b ) (49) = v t [ a ] ( x t ) p ( x 0 = a x t ) + v t [ b ] ( x t ) p ( x 0 = b x t ) (48) v t x t = v t [ a ] x t p x t x 0 = a + v t [ b ] x t p x t x 0 = b p x t x 0 = a + p x t x 0 = b (49) = v t [ a ] x t p x 0 = a x t + v t [ b ] x t p x 0 = b x t {:[(48)v_(t)^(**)(x_(t))=(v_(t)^([a])(x_(t))*p(x_(t)∣x_(0)=a)+v_(t)^([b])(x_(t))*p(x_(t)∣x_(0)=b))/(p(x_(t)∣x_(0)=a)+p(x_(t)∣x_(0)=b))],[(49)=v_(t)^([a])(x_(t))*p(x_(0)=a∣x_(t))+v_(t)^([b])(x_(t))*p(x_(0)=b∣x_(t))]:}\begin{align*} v_{t}^{*}\left(x_{t}\right) & =\frac{v_{t}^{[a]}\left(x_{t}\right) \cdot p\left(x_{t} \mid x_{0}=a\right)+v_{t}^{[b]}\left(x_{t}\right) \cdot p\left(x_{t} \mid x_{0}=b\right)}{p\left(x_{t} \mid x_{0}=a\right)+p\left(x_{t} \mid x_{0}=b\right)} \tag{48}\\ & =v_{t}^{[a]}\left(x_{t}\right) \cdot p\left(x_{0}=a \mid x_{t}\right)+v_{t}^{[b]}\left(x_{t}\right) \cdot p\left(x_{0}=b \mid x_{t}\right) \tag{49} \end{align*}(48)vt(xt)=vt[a](xt)p(xtx0=a)+vt[b](xt)p(xtx0=b)p(xtx0=a)+p(xtx0=b)(49)=vt[a](xt)p(x0=axt)+vt[b](xt)p(x0=bxt)
Explicitly, the weight for v t [ a ] v t [ a ] v_(t)^([a])v_{t}^{[a]}vt[a] at a point x t x t x_(t)x_{t}xt is the probability that x t x t x_(t)x_{t}xt was generated from initial point x 0 = a x 0 = a x_(0)=ax_{0}=ax0=a, rather than x 0 = b x 0 = b x_(0)=bx_{0}=bx0=b.
To be intuitively convinced of this 34 34 ^(34){ }^{34}34, consider the corresponding question about gasses illustrated in Figure 5. Suppose we have two overlapping gases, a red gas with density N ( a , σ 2 ) N a , σ 2 N(a,sigma^(2))\mathcal{N}\left(a, \sigma^{2}\right)N(a,σ2) and velocity v t [ a ] v t [ a ] v_(t)^([a])v_{t}^{[a]}vt[a], and a blue gas with density N ( b , σ 2 ) N b , σ 2 N(b,sigma^(2))\mathcal{N}\left(b, \sigma^{2}\right)N(b,σ2) and velocity v t [ b ] v t [ b ] v_(t)^([b])v_{t}^{[b]}vt[b]. We want to know, what is the effective velocity of the combined gas (as if we saw only in grayscale)? We should clearly take a weighted-average of the individual gas velocities, weighted by their respective densities just as in Equation (49).
We have now solved the main subproblem of this section: we have found one particular vector field v t v t v_(t)^(**)v_{t}^{*}vt which transports p t p t p_(t)p_{t}pt to p t Δ t p t Δ t p_(t-Delta t)p_{t-\Delta t}ptΔt, for our two-point distribution p 0 p 0 p_(0)p_{0}p0. It remains to show that this v t v t v_(t)^(**)v_{t}^{*}vt is equivalent to the velocity field of Algorithm 2 ( v t v t v_(t)v_{t}vt from Equation 40). 32 32 ^(32){ }^{32}32 Pay careful attention to which distributions we take expectations over! The expectation in Equation (45) is w.r.t. the single-point distribution δ a δ a delta_(a)\delta_{a}δa, but our definition of the DDIM algorithm, and its vector field in Equation (40), are always w.r.t. the target distribution. In our case, the target distribution is p 0 p 0 p_(0)p_{0}p0 of Equation (43).
33 33 ^(33){ }^{33}33 Note that we can write the density N ( x t ; a , σ t 2 ) N x t ; a , σ t 2 N(x_(t);a,sigma_(t)^(2))\mathcal{N}\left(x_{t} ; a, \sigma_{t}^{2}\right)N(xt;a,σt2) as p ( x t x 0 = a ) p x t x 0 = a p(x_(t)∣x_(0)=a)p\left(x_{t} \mid x_{0}=a\right)p(xtx0=a).
34 34 ^(34){ }^{34}34 The time step must be small enough for this analogy to hold, so the DDIM updates are essentially infinitesimal steps. Otherwise, if the step size is large, it may not be possible to combine the two transport maps with "local" (i.e. pointwise) operations alone.
Figure 5: Illustration of combining the velocity fields of two gasses. Left: The density and velocity fields of two independent gases (in red and blue). Right: The effective density and velocity field of the combined gas, including streamlines.
To show this, first notice that the individual vector field v t [ a ] v t [ a ] v_(t)^([a])v_{t}^{[a]}vt[a] can be written as a conditional expectation. Using the definition in Equation ( 45 ) 35 ( 45 ) 35 (45)^(35)(45)^{35}(45)35,
(50) v t [ a ] ( x t ) = λ E x 0 δ a [ x t Δ t x t x t ] (51) = λ E x 0 1 / 2 δ a + 1 / 2 δ b [ x t Δ t x t x 0 = a , x t ] (50) v t [ a ] x t = λ E x 0 δ a x t Δ t x t x t (51) = λ E x 0 1 / 2 δ a + 1 / 2 δ b x t Δ t x t x 0 = a , x t {:[(50)v_(t)^([a])(x_(t))=lambdaE_(x_(0)∼delta_(a))[x_(t-Delta t)-x_(t)∣x_(t)]],[(51)=lambdaE_(x_(0)∼1//2delta_(a)+1//2delta_(b))[x_(t-Delta t)-x_(t)∣x_(0)=a,x_(t)]]:}\begin{align*} v_{t}^{[a]}\left(x_{t}\right) & =\lambda \underset{x_{0} \sim \delta_{a}}{\mathbb{E}}\left[x_{t-\Delta t}-x_{t} \mid x_{t}\right] \tag{50}\\ & =\lambda \underset{x_{0} \sim 1 / 2 \delta_{a}+1 / 2 \delta_{b}}{\mathbb{E}}\left[x_{t-\Delta t}-x_{t} \mid x_{0}=a, x_{t}\right] \tag{51} \end{align*}(50)vt[a](xt)=λEx0δa[xtΔtxtxt](51)=λEx01/2δa+1/2δb[xtΔtxtx0=a,xt]
Now the entire vector field v t v t v_(t)^(**)v_{t}^{*}vt can be written as a conditional expectation:
(52) v t ( x t ) = v t [ a ] ( x t ) p ( x 0 = a x t ) + v t [ b ] ( x t ) p ( x 0 = b x t ) (53) = λ E [ x t Δ t x t x 0 = a , x t ] p ( x 0 = a x t ) (54) + λ E [ x t Δ t x t x 0 = b , x t ] p ( x 0 = b x t ) (55) = λ E [ x t Δ t x t x t ] = v t ( x t ) (from Equ (52) v t x t = v t [ a ] x t p x 0 = a x t + v t [ b ] x t p x 0 = b x t (53) = λ E x t Δ t x t x 0 = a , x t p x 0 = a x t (54) + λ E x t Δ t x t x 0 = b , x t p x 0 = b x t (55) = λ E x t Δ t x t x t = v t x t  (from Equ  {:[(52)v_(t)^(**)(x_(t))=v_(t)^([a])(x_(t))*p(x_(0)=a∣x_(t))+v_(t)^([b])(x_(t))*p(x_(0)=b∣x_(t))],[(53)=lambdaE[x_(t-Delta t)-x_(t)∣x_(0)=a,x_(t)]*p(x_(0)=a∣x_(t))],[(54)+lambdaE[x_(t-Delta t)-x_(t)∣x_(0)=b,x_(t)]*p(x_(0)=b∣x_(t))],[(55)=lambdaE[x_(t-Delta t)-x_(t)∣x_(t)]],[=v_(t)(x_(t))quad" (from Equ "]:}\begin{align*} v_{t}^{*}\left(x_{t}\right)= & v_{t}^{[a]}\left(x_{t}\right) \cdot p\left(x_{0}=a \mid x_{t}\right)+v_{t}^{[b]}\left(x_{t}\right) \cdot p\left(x_{0}=b \mid x_{t}\right) \tag{52}\\ = & \lambda \mathbb{E}\left[x_{t-\Delta t}-x_{t} \mid x_{0}=a, x_{t}\right] \cdot p\left(x_{0}=a \mid x_{t}\right) \tag{53}\\ & +\lambda \mathbb{E}\left[x_{t-\Delta t}-x_{t} \mid x_{0}=b, x_{t}\right] \cdot p\left(x_{0}=b \mid x_{t}\right) \tag{54}\\ = & \lambda \mathbb{E}\left[x_{t-\Delta t}-x_{t} \mid x_{t}\right] \tag{55}\\ = & v_{t}\left(x_{t}\right) \quad \text { (from Equ } \end{align*}(52)vt(xt)=vt[a](xt)p(x0=axt)+vt[b](xt)p(x0=bxt)(53)=λE[xtΔtxtx0=a,xt]p(x0=axt)(54)+λE[xtΔtxtx0=b,xt]p(x0=bxt)(55)=λE[xtΔtxtxt]=vt(xt) (from Equ 
where all expectations are w.r.t. the distribution x 0 1 / 2 δ a + 1 / 2 δ b x 0 1 / 2 δ a + 1 / 2 δ b x_(0)∼1//2delta_(a)+1//2delta_(b)x_{0} \sim 1 / 2 \delta_{a}+1 / 2 \delta_{b}x01/2δa+1/2δb. Thus, the combined velocity field v t v t v_(t)^(**)v_{t}^{*}vt is exactly the velocity field v t v t v_(t)v_{t}vt given by the updates of Algorithm 2 - so Algorithm 2 is a correct reverse sampler for our two-point mixture distribution.

3.4 Case 3: Arbitrary Distributions

Now that we know how to handle two points, we can generalize this idea to arbitrary distributions of x 0 x 0 x_(0)x_{0}x0. We will not go into details here, because the general proof will be subsumed by the subsequent section.
It turns out that our overall proof strategy for Algorithm 2 can be generalized significantly to other types of diffusions, without much
work. This yields the idea of flow matching, which we will see in the following section. Once we develop the machinery of flows, it is actually straightforward to derive DDIM directly from the simple single-point scaling algorithm of Equation (37): see Appendix B.5.

3.5 The Probability Flow ODE [Optional]

Finally, we generalize our discrete-time deterministic sampler to an ordinary differential equation (ODE) called the probability flow ODE [Song et al., 2020]. The following section builds on our discussion of SDEs as the continuous limit of diffusion in section 2.4. Just as the reverse-time SDEs of section 2.4 offered a flexible continuoustime generalization of discrete stochastic samplers, so we will see that discrete deterministic samplers generalize to ODEs. The ODE formulation offers both a useful theoretical lens through which to view diffusion, as well as practical advantages, like the opportunity to choose from a variety of off-the-shelf and custom ODE solvers to improve sampling (like the popular DPM++ method, as discussed in chapter 5).
Recall the general SDE (26) from section 2.4:
d x = f ( x , t ) d t + g ( t ) d w d x = f ( x , t ) d t + g ( t ) d w dx=f(x,t)dt+g(t)dwd x=f(x, t) d t+g(t) d wdx=f(x,t)dt+g(t)dw
Song et al. [2020] showed that is possible to convert this SDE into a deterministic equivalent called the probability flow ODE (PF-ODE): 36
(56) d x d t = f ~ ( x , t ) , where f ~ ( x , t ) = f ( x , t ) 1 2 g ( t ) 2 x log p t ( x ) (56) d x d t = f ~ ( x , t ) ,  where  f ~ ( x , t ) = f ( x , t ) 1 2 g ( t ) 2 x log p t ( x ) {:(56)(dx)/(dt)= tilde(f)(x","t)","quad" where " tilde(f)(x","t)=f(x","t)-(1)/(2)g(t)^(2)grad_(x)log p_(t)(x):}\begin{equation*} \frac{d x}{d t}=\tilde{f}(x, t), \quad \text { where } \tilde{f}(x, t)=f(x, t)-\frac{1}{2} g(t)^{2} \nabla_{x} \log p_{t}(x) \tag{56} \end{equation*}(56)dxdt=f~(x,t), where f~(x,t)=f(x,t)12g(t)2xlogpt(x)
SDE (26) and ODE (56) are equivalent in the sense that trajectories obtained by solving the PF-ODE have the same marginal distributions as the SDE trajectories at every point in time 37 37 ^(37){ }^{37}37. However, note that the score appears here again, as it did in the reverse SDE (27); just as for the reverse SDE, we must learn the score to make the ODE (56) practically useful.
Just as DDPM was a (discretized) special-case of the reverse-time SDE (27), so DDIM can be seen as a (discretized) special case of the PF-ODE (56). Recall from section 2.4 that the simple diffusion we have been studying corresponds to the SDE SDE SDE\operatorname{SDE}SDE (25) with f = 0 f = 0 f=0f=0f=0 and g = σ q g = σ q g=sigma_(q)g=\sigma_{q}g=σq. The corresponding ODE is
(57) d x d t = 1 2 σ q 2 x log p t ( x ) (58) = 1 2 t E [ x 0 x t x t ] (by eq. 28) (57) d x d t = 1 2 σ q 2 x log p t ( x ) (58) = 1 2 t E x 0 x t x t  (by eq. 28)  {:[(57)(dx)/(dt)=-(1)/(2)sigma_(q)^(2)grad_(x)log p_(t)(x)],[(58)=-(1)/(2t)E[x_(0)-x_(t)∣x_(t)]quad" (by eq. 28) "]:}\begin{align*} \frac{d x}{d t} & =-\frac{1}{2} \sigma_{q}^{2} \nabla_{x} \log p_{t}(x) \tag{57}\\ & =-\frac{1}{2 t} \mathbb{E}\left[x_{0}-x_{t} \mid x_{t}\right] \quad \text { (by eq. 28) } \tag{58} \end{align*}(57)dxdt=12σq2xlogpt(x)(58)=12tE[x0xtxt] (by eq. 28) 
Reversing and discretizing yields
x t Δ t = x t + Δ t 2 t E [ x 0 x t x t ] = x t + 1 2 ( E [ x t Δ t x t ] x t ) (by eq. 23) x t Δ t = x t + Δ t 2 t E x 0 x t x t = x t + 1 2 E x t Δ t x t x t  (by eq. 23)  {:[x_(t-Delta t)=x_(t)+(Delta t)/(2t)E[x_(0)-x_(t)∣x_(t)]],[=x_(t)+(1)/(2)(E[x_(t-Delta t)∣x_(t)]-x_(t))quad" (by eq. 23) "]:}\begin{aligned} x_{t-\Delta t} & =x_{t}+\frac{\Delta t}{2 t} \mathbb{E}\left[x_{0}-x_{t} \mid x_{t}\right] \\ & =x_{t}+\frac{1}{2}\left(\mathbb{E}\left[x_{t-\Delta t} \mid x_{t}\right]-x_{t}\right) \quad \text { (by eq. 23) } \end{aligned}xtΔt=xt+Δt2tE[x0xtxt]=xt+12(E[xtΔtxt]xt) (by eq. 23) 
Noting that lim Δ t 0 ( σ t σ t Δ t t σ t ) = 1 2 lim Δ t 0 σ t σ t Δ t t σ t = 1 2 lim_(Delta t rarr0)((sigma_(t))/(sigma_(t-Delta t)tsigma_(t)))=(1)/(2)\lim _{\Delta t \rightarrow 0}\left(\frac{\sigma_{t}}{\sigma_{t-\Delta t} t \sigma_{t}}\right)=\frac{1}{2}limΔt0(σtσtΔttσt)=12, we recover the deterministic (DDIM) sampler (33).

3.6 Discussion: DDPM vs DDIM

The two reverse samplers defined above (DDPM and DDIM) are conceptually significantly different: one is deterministic, and the other stochastic. To review, these samplers use the following strategies:
  1. DDPM ideally implements a stochastic map F t F t F_(t)F_{t}Ft, such that the output F t ( x t ) F t x t F_(t)(x_(t))F_{t}\left(x_{t}\right)Ft(xt) is, pointwise, a sample from the conditional distribution p ( x t Δ t x t ) p x t Δ t x t p(x_(t-Delta t)∣x_(t))p\left(x_{t-\Delta t} \mid x_{t}\right)p(xtΔtxt).
  2. DDIM ideally implements a deterministic map F t F t F_(t)F_{t}Ft, such that the output F t ( x t ) F t x t F_(t)(x_(t))F_{t}\left(x_{t}\right)Ft(xt) is marginally distributed as p t Δ t p t Δ t p_(t-Delta t)p_{t-\Delta t}ptΔt. That is, F t p t = p t Δ t F t p t = p t Δ t F_(t)♯p_(t)=p_(t-Delta t)F_{t} \sharp p_{t}=p_{t-\Delta t}Ftpt=ptΔt.
Although they both happen to take steps in the same direction 38 38 ^(38){ }^{38}38
38 38 ^(38){ }^{38}38 Steps proportional to ( μ t Δ t ( x t ) x t ) μ t Δ t x t x t (mu_(t-Delta t)(x_(t))-x_(t))\left(\mu_{t-\Delta t}\left(x_{t}\right)-x_{t}\right)(μtΔt(xt)xt). (given the same input x t x t x_(t)x_{t}xt ), the two algorithms end up evolving very differently. To see this, let's consider how each sampler ideally behaves, when started from the same initial point x 1 x 1 x_(1)x_{1}x1 and iterated to completion.
DDPM will ideally produce a sample from p ( x 0 x 1 ) p x 0 x 1 p(x_(0)∣x_(1))p\left(x_{0} \mid x_{1}\right)p(x0x1). If the forward process mixes sufficiently (i.e. for large σ q σ q sigma_(q)\sigma_{q}σq in our setup), then the final point x 1 x 1 x_(1)x_{1}x1 will be nearly independent from the initial point. Thus p ( x 0 x 1 ) p ( x 0 ) p x 0 x 1 p x 0 p(x_(0)∣x_(1))~~p(x_(0))p\left(x_{0} \mid x_{1}\right) \approx p\left(x_{0}\right)p(x0x1)p(x0), so the distribution output by the ideal DDPM will not depend at all 39 39 ^(39){ }^{39}39 on the starting point x 1 x 1 x_(1)x_{1}x1. In contrast, DDIM is deterministic, so it will always produce a fixed value for a given x 1 x 1 x_(1)x_{1}x1, and thus will depend very strongly on x 1 x 1 x_(1)x_{1}x1.
The picture to have in mind is, DDIM defines a deterministic map R d R d R d R d R^(d)rarrR^(d)\mathbb{R}^{d} \rightarrow \mathbb{R}^{d}RdRd, taking samples from a Gaussian distribution to our target distribution. At this level, the DDIM map may sound similar to other generative models - after all, GANs and Normalizing Flows also define maps from Gaussian noise to the true distribution. What is special about the DDIM map is, it is not allowed to be arbitrary: the target distribution p p p^(**)p^{*}p exactly determines the ideal DDIM map (which we train models to emulate). This map is "nice"; for example we expect it to be smooth if our target distribution is smooth. GANs, in contrast, are free to learn any arbitrary mapping between noise and images. This feature of diffusion models may make the learning
problem easier in some cases (since it is supervised), or harder in other cases (since there may be easier-to-learn maps which other methods could find).

3.7 Remarks on Generalization

In this tutorial, we have not discussed the learning-theoretic aspects of diffusion models: How do we learn properties of the underlying distribution, given only finite samples and bounded compute? These are fundamental aspects of learning, but are not yet fully understood for diffusion models; it is an active area of research 40 40 ^(40){ }^{40}40.
To appreciate the subtlety here, suppose we learn a diffusion model using the classic strategy of Empirical Risk Minimization (ERM): we sample a finite train set from the underlying distribution, and optimize all regression functions w.r.t. this empirical distribution. The problem is, we should not perfectly minimize the empirical risk, because this would yield a diffusion model which only reproduces the train samples 41 41 ^(41){ }^{41}41.
In general learning the diffusion model must be regularized, implicitly or explicitly, to prevent overfitting and memorization of the training data. When we train deep neural networks for use in diffusion models, this regularization often occurs implicitly: factors such as finite model size and optimization randomness prevent the trained model from perfectly memorizing its train set. We will revisit these factors (as sources of error) in Section 5 .
This issue of memorizing training data has been seen "in the wild" in diffusion models trained on small image datasets, and it has been observed that memorization reduces as the training set size increases [Somepalli et al., 2023, Gu et al., 2023]. Additionally, memorization as been noted as a potential security and copyright issue for neural networks as in Carlini et al. [2023] where the authors found they can recover training data from stable diffusion with the right prompts.
Figure 6 demonstrates the effect of training set size, and shows the DDIM trajectories for a diffusion model trained using a 3 layer ReLU network. We see that the diffusion model on N = 10 N = 10 N=10N=10N=10 samples "memorizes" its train set: its trajectories all collapse to one of the train points, instead of producing the underlying spiral distribution. As we add more samples, the model starts to generalize: the trajectories converge to the underlying spiral manifold. The trajectories also start to become more perpendicular the underlying manifold, suggesting that the low dimensional structure is being learned. We also note that in the N = 10 N = 10 N=10N=10N=10 case where the diffusion model fails, it is not at all obvious a human would be able to identify the "correct" pattern from these samples, so generalization may be too much to expect. 40 40 ^(40){ }^{40}40 We recommend the introductions of Chen et al. [2022] and Chen et al. [2024b] for an overview of recent learning-theoretic results. This line of work includes e.g. De Bortoli et al [2021], De Bortoli [2022], Lee et al. [2023], Chen et al. [2023, 2024a].
41 41 ^(41){ }^{41}41 This is not specific to diffusion models: any perfect generative model of the empirical distribution will always output a uniformly random train point, which is far-from-optimal w.r.t. the true underlying distribution.
Figure 6: The DDIM trajectories (shaded by timestep t t ttt ) for a spiral dataset. We compare the trajectories with 10,20 , and 40 training samples. Note that as we add more training points (moving left to right) the diffusion algorithm begins to learn the underlying spiral and the trajectories look more perpendicular to the underlying manifold. The network used here is a 3 layer ReLU network with 128 neurons per layer.

4 Flow Matching

We now introduce the framework of flow matching [Peluchetti, 2022, Liu et al., 2022b,a, Lipman et al., 2023, Albergo et al., 2023]. Flow matching can be thought of as a generalization of DDIM, which allows for more flexibility in designing generative models- including for example the rectified flows (sometimes called linear flows) used by Stable Diffusion 3 [Liu et al., 2022a, Esser et al., 2024].
We have actually already seen the main ideas behind flow matching, in our analysis of DDIM in Section 3. At a high level, here is how we constructed a generative model in Section 3:
  1. First, we defined how to generate a single point. Specifically, we constructed vector fields { v t [ a ] } t v t [ a ] t {v_(t)^([a])}_(t)\left\{v_{t}^{[a]}\right\}_{t}{vt[a]}t which, when applied for all time steps, transported a standard Gaussian distribution to an arbitrary delta distribution δ a δ a delta_(a)\delta_{a}δa.
  2. Second, we determined how to combine two vector fields into a single effective vector field. This lets us construct a transport from the standard Gaussian to two points (or, more generally, to a distribution over points - our target distribution).
Neither of these steps particularly require the Gaussian base distribution, or the Gaussian forward process (Equation 1). The second step of combining vector fields remains identical for any two arbitrary vector fields, for example.
So let's drop all the Gaussian assumptions. Instead, we will begin by thinking at a basic level about how to map between any two points x 0 x 0 x_(0)x_{0}x0 and x 1 x 1 x_(1)x_{1}x1. Then, we see what happens when the two points are sampled from arbitrary distributions p p ppp (data) and q q qqq (base), respectively. We will see that this point of view encompasses DDIM as a special case, but that it is significantly more general.

4.1 Flows

Let us first define the central notion of a flow. A flow is simply a collection of time-indexed vector fields v = { v t } t [ 0 , 1 ] v = v t t [ 0 , 1 ] v={v_(t)}_(t in[0,1])v=\left\{v_{t}\right\}_{t \in[0,1]}v={vt}t[0,1]. We should think of this as the velocity-field v t v t v_(t)v_{t}vt of a gas at each time t t ttt, as we did earlier in Section 3.2. Any flow defines a trajectory taking initial points x 1 x 1 x_(1)x_{1}x1 to final points x 0 x 0 x_(0)x_{0}x0, by transporting the initial point along the velocity fields { v t } v t {v_(t)}\left\{v_{t}\right\}{vt}.
Figure 7: Running a flow which generates a spiral distribution (bottom) from an annular distribution (top).
Formally, for flow v v vvv and initial point x 1 x 1 x_(1)x_{1}x1, consider the ODE 42 42 ^(42){ }^{42}42
(59) d x t d t = v t ( x t ) (59) d x t d t = v t x t {:(59)(dx_(t))/(dt)=-v_(t)(x_(t)):}\begin{equation*} \frac{d x_{t}}{d t}=-v_{t}\left(x_{t}\right) \tag{59} \end{equation*}(59)dxtdt=vt(xt)
with initial condition x 1 x 1 x_(1)x_{1}x1 at time t = 1 t = 1 t=1t=1t=1. We write
(60) x t := RunFlow ( v , x 1 , t ) (60) x t := RunFlow v , x 1 , t {:(60)x_(t):=RunFlow(v,x_(1),t):}\begin{equation*} x_{t}:=\operatorname{RunFlow}\left(v, x_{1}, t\right) \tag{60} \end{equation*}(60)xt:=RunFlow(v,x1,t)
to denote the solution to the flow ODE (Equation 59) at time t t ttt, terminating at final point x 0 x 0 x_(0)x_{0}x0. That is, RunFlow is the result of transporting point x 1 x 1 x_(1)x_{1}x1 along the flow v v vvv up to time t t ttt.
Just as flows define maps between initial and final points, they also define transports between entire distributions, by "pushing forward" points from the source distribution along their trajectories. If p 1 p 1 p_(1)p_{1}p1 is a distribution on initial points 43 43 ^(43){ }^{43}43, then applying the flow v v vvv yields the distribution on final points 44 44 ^(44){ }^{44}44
(61) p 0 = { RunFlow ( v , x 1 , t = 0 ) } x 1 p 1 (61) p 0 = RunFlow v , x 1 , t = 0 x 1 p 1 {:(61)p_(0)={RunFlow(v,x_(1),t=0)}_(x_(1)∼p_(1)):}\begin{equation*} p_{0}=\left\{\operatorname{RunFlow}\left(v, x_{1}, t=0\right)\right\}_{x_{1} \sim p_{1}} \tag{61} \end{equation*}(61)p0={RunFlow(v,x1,t=0)}x1p1
We denote this process as p 1 v p 0 p 1 v p 0 p_(1)↪^(v)p_(0)p_{1} \stackrel{v}{\hookrightarrow} p_{0}p1vp0 meaning the flow v v vvv transports initial distribution p 1 p 1 p_(1)p_{1}p1 to final distribution 45 p 0 45 p 0 45p_(0)45 p_{0}45p0.
The ultimate goal of flow matching is to somehow learn a flow v v v^(**)v^{*}v which transports q v p q v p q↪^(v^(**))pq \stackrel{v^{*}}{\hookrightarrow} pqvp, where p p ppp is the target distribution and q q qqq is some easy-to-sample base distribution (such as a Gaussian). If we had this v v v^(**)v^{*}v, we could generate samples from our target p p ppp by first sampling x 1 q x 1 q x_(1)∼qx_{1} \sim qx1q, then running our flow with initial point x 1 x 1 x_(1)x_{1}x1 and outputting the resulting final point x 0 x 0 x_(0)x_{0}x0. The DDIM algorithm of Section 3 was actually a special case 46 46 ^(46){ }^{46}46 of this, for a very particular choice of flow v v v^(**)v^{*}v. Now, how do we construct such flows in general?

4.2 Pointwise Flows

Our basic building-block will be a pointwise flow which just transports a single point x 1 x 1 x_(1)x_{1}x1 to a point x 0 x 0 x_(0)x_{0}x0. Intuitively, given an arbitrary path { x t } t [ 0 , 1 ] x t t [ 0 , 1 ] {x_(t)}_(t in[0,1])\left\{x_{t}\right\}_{t \in[0,1]}{xt}t[0,1] that connects x 1 x 1 x_(1)x_{1}x1 to x 0 x 0 x_(0)x_{0}x0, a pointwise flow describes this trajectory by giving its velocity v t ( x t ) v t x t v_(t)(x_(t))v_{t}\left(x_{t}\right)vt(xt) at each point x t x t x_(t)x_{t}xt along it (see Figure 8). Formally, a pointwise flow between x 1 x 1 x_(1)x_{1}x1 and x 0 x 0 x_(0)x_{0}x0 is any flow { v t } t v t t {v_(t)}_(t)\left\{v_{t}\right\}_{t}{vt}t that satisfies Equation 59 with boundary conditions x 1 x 1 x_(1)x_{1}x1 and x 0 x 0 x_(0)x_{0}x0 at times t = 1 , 0 t = 1 , 0 t=1,0t=1,0t=1,0 respectively. We denote such flows as v [ x 1 , x 0 ] v x 1 , x 0 v^([x_(1),x_(0)])v^{\left[x_{1}, x_{0}\right]}v[x1,x0]. Pointwise flows are not unique: there are many different choices of path between x 0 x 0 x_(0)x_{0}x0 and x 1 x 1 x_(1)x_{1}x1.

4.3 Marginal Flows

Suppose that for all pairs of points ( x 1 , x 0 ) x 1 , x 0 (x_(1),x_(0))\left(x_{1}, x_{0}\right)(x1,x0), we can construct an explicit pointwise flow v [ x 1 , x 0 ] v x 1 , x 0 v^([x_(1),x_(0)])v^{\left[x_{1}, x_{0}\right]}v[x1,x0] that transports a source point x 1 x 1 x_(1)x_{1}x1 to target 42 42 ^(42){ }^{42}42 The corresponding discretetime analog is the iteration: x t Δ t x t + v t ( x t ) Δ t x t Δ t x t + v t x t Δ t x_(t-Delta t)larrx_(t)+v_(t)(x_(t))Delta tx_{t-\Delta t} \leftarrow x_{t}+v_{t}\left(x_{t}\right) \Delta txtΔtxt+vt(xt)Δt, starting at t = 1 t = 1 t=1t=1t=1 with initial point x 1 x 1 x_(1)x_{1}x1.
43 43 ^(43){ }^{43}43 Notational warning: Most of the flow matching literature uses a reversed time convention, so t = 1 t = 1 t=1t=1t=1 is the target distribution. We let t = 0 t = 0 t=0t=0t=0 be the target distribution to be consistent with the DDPM convention.
44 44 ^(44){ }^{44}44 We could equivalently write this as the pushforward RunFlow ( v , 0 ) p 1 ( v , 0 ) p 1 (v,*0)♯p_(1)(v, \cdot 0) \sharp p_{1}(v,0)p1. 45 45 ^(45){ }^{45}45 In our gas analogy, this means if we start with a gas of particles distributed according to p 1 p 1 p_(1)p_{1}p1, and each particle follows the trajectory defined by v v vvv, then the final distribution of particles will be p 0 p 0 p_(0)p_{0}p0.
46 46 ^(46){ }^{46}46 To connect to diffusion: The continuous-time limit of DDIM (58) is a flow with v t ( x t ) = 1 2 t E [ x 0 x t x t ] v t x t = 1 2 t E x 0 x t x t v_(t)(x_(t))=(1)/(2t)E[x_(0)-x_(t)∣x_(t)]v_{t}\left(x_{t}\right)=\frac{1}{2 t} \mathbb{E}\left[x_{0}-x_{t} \mid x_{t}\right]vt(xt)=12tE[x0xtxt]. The base distribution p 1 p 1 p_(1)p_{1}p1 is Gaussian. DDIM Sampling (algorithm 3) is a discretized method for evaluating RunFlow. DDPM Training (algorithm 2) is a method for learning v v v^(***)v^{\star}v - but it relies on the Gaussian structure and differs somewhat from the flow-matching algorithm we will present in this chapter.
Figure 8: A pointwise flow v t [ x 1 , x 0 ] v t x 1 , x 0 v_(t)^([x_(1),x_(0)])v_{t}^{\left[x_{1}, x_{0}\right]}vt[x1,x0] transporting x 1 x 1 x_(1)x_{1}x1 to x 0 x 0 x_(0)x_{0}x0.
point x 0 x 0 x_(0)x_{0}x0. For example, we could let x t x t x_(t)x_{t}xt travel along a straight line from x 1 x 1 x_(1)x_{1}x1 to x 0 x 0 x_(0)x_{0}x0, or along any other explicit path. Recall in our gas analogy, this corresponds to an individual particle that moves between x 1 x 1 x_(1)x_{1}x1 and x 0 x 0 x_(0)x_{0}x0. Now, let us try to set up a collection of individual particles, such that at t = 1 t = 1 t=1t=1t=1 the particles are distributed according to q q qqq, and at t = 0 t = 0 t=0t=0t=0 they are distributed according to p p ppp. This is actually easy to do: We can pick any coupling 47 Π q , p 47 Π q , p 47Pi_(q,p)47 \Pi_{q, p}47Πq,p between q q qqq and p p ppp, and consider particles corresponding to the pointwise flows { v [ x 1 , x 0 ] } ( x 1 , x 0 ) Π q , p v x 1 , x 0 x 1 , x 0 Π q , p {v^([x_(1),x_(0)])}_((x_(1),x_(0))∼Pi_(q,p))\left\{v^{\left[x_{1}, x_{0}\right]}\right\}_{\left(x_{1}, x_{0}\right) \sim \Pi_{q, p}}{v[x1,x0]}(x1,x0)Πq,p. This gives us a distribution over pointwise flows (i.e. a collection of particle trajectories) with the desired behavior in aggregate.
We would like to combine all of these pointwise flows somehow, to get a single flow v v v^(**)v^{*}v that implements the same transport between distributions 48 48 ^(48){ }^{48}48. Our previous discussion 49 49 ^(49){ }^{49}49 in Section 3 tells us how to do this: to determine the effective velocity v t ( x t ) v t x t v_(t)^(**)(x_(t))v_{t}^{*}\left(x_{t}\right)vt(xt), we should take a weighted-average of all individual particle velocities v t [ x 1 , x 0 ] v t x 1 , x 0 v_(t)^([x_(1),x_(0)])v_{t}^{\left[x_{1}, x_{0}\right]}vt[x1,x0], weighted by the probability that a particle at x t x t x_(t)x_{t}xt was generated by the pointwise flow v [ x 1 , x 0 ] v x 1 , x 0 v^([x_(1),x_(0)])v^{\left[x_{1}, x_{0}\right]}v[x1,x0]. The final result is 50 50 ^(50){ }^{50}50
(64) v t ( x t ) := E x 0 , x 1 x t [ v t [ x 1 , x 0 ] ( x t ) x t ] (64) v t x t := E x 0 , x 1 x t v t x 1 , x 0 x t x t {:(64)v_(t)^(**)(x_(t)):=E_(x_(0),x_(1)∣x_(t))[v_(t)^([x_(1),x_(0)])(x_(t))∣x_(t)]:}\begin{equation*} v_{t}^{*}\left(x_{t}\right):=\underset{x_{0}, x_{1} \mid x_{t}}{\mathbb{E}}\left[v_{t}^{\left[x_{1}, x_{0}\right]}\left(x_{t}\right) \mid x_{t}\right] \tag{64} \end{equation*}(64)vt(xt):=Ex0,x1xt[vt[x1,x0](xt)xt]
where the expectation is w.r.t. the joint distribution of
( x 1 , x 0 , x t ) x 1 , x 0 , x t (x_(1),x_(0),x_(t))\left(x_{1}, x_{0}, x_{t}\right)(x1,x0,xt) induced by sampling ( x 1 , x 0 ) Π q , p x 1 , x 0 Π q , p (x_(1),x_(0))∼Pi_(q,p)\left(x_{1}, x_{0}\right) \sim \Pi_{q, p}(x1,x0)Πq,p and letting x t RunFlow ( v [ x 1 , x 0 ] , x 1 , t ) x t RunFlow v x 1 , x 0 , x 1 , t x_(t)larr RunFlow(v^([x_(1),x_(0)]),x_(1),t)x_{t} \leftarrow \operatorname{RunFlow}\left(v^{\left[x_{1}, x_{0}\right]}, x_{1}, t\right)xtRunFlow(v[x1,x0],x1,t)
At this point, we have a "solution" to our generative modeling problem in principle, but some important questions remain to make it useful in practice:
  • Which pointwise flow v [ x 1 , x 0 ] v x 1 , x 0 v^([x_(1),x_(0)])v^{\left[x_{1}, x_{0}\right]}v[x1,x0] and coupling Π q , p Π q , p Pi_(q,p)\Pi_{q, p}Πq,p should we chose?
  • How do we compute the marginal flow v v v^(**)v^{*}v ? We cannot compute it from Equation (64) directly, because this would require sampling from p ( x 0 x t ) p x 0 x t p(x_(0)∣x_(t))p\left(x_{0} \mid x_{t}\right)p(x0xt) for a given point x t x t x_(t)x_{t}xt, which may be complicated in general.
We answer these in the next sections.

4.4 A Simple Choice of Pointwise Flow

We need an explicit choices of: pointwise flow, base distribution q q qqq, and coupling Π q , p Π q , p Pi_(q,p)\Pi_{q, p}Πq,p. There are many simple choices which would work 51 51 ^(51){ }^{51}51.
The base distribution q q qqq can be essentially any easy-to-sample distribution. Gaussians are a popular choice but certainly not the only one- Figure 7 uses an annular base distribution, for example. As for the coupling Π q , p Π q , p Pi_(q,p)\Pi_{q, p}Πq,p between the base and target distribution, the simplest choice is the independent coupling, i.e. sampling from p p ppp and q q qqq independently. 47 47 ^(47){ }^{47}47 A coupling Π q , p Π q , p Pi_(q,p)\Pi_{q, p}Πq,p between q q qqq and p p ppp, specifies how to jointly sample pairs ( x 1 , x 0 ) x 1 , x 0 (x_(1),x_(0))\left(x_{1}, x_{0}\right)(x1,x0) of source and target points, such that x 0 x 0 x_(0)x_{0}x0 is marginally distributed as p p ppp, and x 1 x 1 x_(1)x_{1}x1 as q q qqq. The most basic coupling is the independent coupling, with corresponds to sampling x 1 , x 0 x 1 , x 0 x_(1),x_(0)x_{1}, x_{0}x1,x0 independently.
48 48 ^(48){ }^{48}48 Why would we like this? As we will see later, it simplifies our learning problem: instead of having to learn the distribution of all the individual trajectories, we can instead just learn one velocity field representing their bulk evolution.
49 49 ^(49){ }^{49}49 Compare to Equation (49) in Section 3. A formal statement of how to combine flows is given in Appendix B.4.
50 50 ^(50){ }^{50}50 An alternate way of viewing this result at a high level is: we start with pointwise flows v [ x 1 , x 0 ] v x 1 , x 0 v^([x_(1),x_(0)])v^{\left[x_{1}, x_{0}\right]}v[x1,x0] which transport delta distributions:
(62) δ x 1 v 1 [ x 1 , x 0 ] δ x 0 (62) δ x 1 v 1 x 1 , x 0 δ x 0 {:(62)delta_(x_(1))rarr"v_(1)^([x_(1),x_(0)])"delta_(x_(0)):}\begin{equation*} \delta_{x_{1}} \xrightarrow{v_{1}^{\left[x_{1}, x_{0}\right]}} \delta_{x_{0}} \tag{62} \end{equation*}(62)δx1v1[x1,x0]δx0
And then Equation (64) gives us a fancy way of "averaging these flows over x 1 x 1 x_(1)x_{1}x1 and x 0 x 0 x_(0)^('')x_{0}{ }^{\prime \prime}x0, to get a flow v v v^(**)v^{*}v transporting
(63) q = E x 1 q [ δ x 1 ] v E x 0 p [ δ x 0 ] = p . (63) q = E x 1 q δ x 1 v E x 0 p δ x 0 = p . {:(63)q=E_(x_(1)∼q)[delta_(x_(1))]longleftrightarrow^(v^(**))E_(x_(0)∼p)[delta_(x_(0))]=p.:}\begin{equation*} q=\underset{x_{1} \sim q}{\mathbb{E}}\left[\delta_{x_{1}}\right] \stackrel{v^{*}}{\longleftrightarrow} \underset{x_{0} \sim p}{\mathbb{E}}\left[\delta_{x_{0}}\right]=p . \tag{63} \end{equation*}(63)q=Ex1q[δx1]vEx0p[δx0]=p.
Figure 9: A marginal flow with linear pointwise flows, base distribution q q qqq uniform over an annulus, and target distribution p p ppp equal to a Dirac-delta at x 0 x 0 x_(0)x_{0}x0. (This can also be thought of as the average over x 1 x 1 x_(1)x_{1}x1 of the pointwise linear flows from x 1 q x 1 q x_(1)∼qx_{1} \sim qx1q to a fixed x 0 x 0 x_(0)x_{0}x0 ). Gray arrows depict the flow field at different times t t ttt. The leftmost ( t = 1 ) ( t = 1 ) (t=1)(t=1)(t=1) plot shows samples from the base distribution q q qqq. Subsequent plots show these samples transported by the flow at intermediate times t t ttt, The final ( t = 0 ) ( t = 0 ) (t=0)(t=0)(t=0) plot shows all points collapsed to the target x 0 x 0 x_(0)x_{0}x0. This particular x 0 x 0 x_(0)x_{0}x0 happens to be one point on the spiral distribution of Figure 7 .
For a pointwise flow, arguably the simplest construction is a linear pointwise flow:
(65) v t [ x 1 , x 0 ] ( x t ) = x 0 x 1 (66) RunFlow ( v [ x 1 , x 0 ] , x 1 , t ) = t x 1 + ( 1 t ) x 0 (65) v t x 1 , x 0 x t = x 0 x 1 (66) RunFlow v x 1 , x 0 , x 1 , t = t x 1 + ( 1 t ) x 0 {:[(65)v_(t)^([x_(1),x_(0)])(x_(t))=x_(0)-x_(1)],[(66)Longrightarrow RunFlow(v^([x_(1),x_(0)]),x_(1),t)=tx_(1)+(1-t)x_(0)]:}\begin{align*} v_{t}^{\left[x_{1}, x_{0}\right]}\left(x_{t}\right) & =x_{0}-x_{1} \tag{65}\\ \Longrightarrow \operatorname{RunFlow}\left(v^{\left[x_{1}, x_{0}\right]}, x_{1}, t\right) & =t x_{1}+(1-t) x_{0} \tag{66} \end{align*}(65)vt[x1,x0](xt)=x0x1(66)RunFlow(v[x1,x0],x1,t)=tx1+(1t)x0
which simply linearly interpolates between x 1 x 1 x_(1)x_{1}x1 and x 0 x 0 x_(0)x_{0}x0 (and corresponds to the choice made in Liu et al. [2022a]). In Figure 9 we visualize a marginal flow composed of linear pointwise flows, the same annular base distribution q q qqq of Figure 7, and target distribution equal to a point-mass ( p = δ x 0 ) 52 p = δ x 0 52 (p=delta_(x_(0)))^(52)\left(p=\delta_{x_{0}}\right)^{52}(p=δx0)52.

4.5 Flow Matching

Now, the only remaining problem is that naively evaluating v v v^(**)v^{*}v using
52 52 ^(52){ }^{52}52 A marginal distribution with a point-mass target distribution - or equivalently the average of pointwise flows over the the base distribution only - is sometimes called a (one-sided) conditional flow [Lipman et al., 2023].
53 53 ^(53){ }^{53}53 This result is analogous to Theorem 2 in Lipman et al. [2023], but ours is for a two-sided flow.
(67) v t ( x t ) := E x 0 , x 1 x t [ v t [ x 1 , x 0 ] ( x t ) x t ] (68) v t = argmin f : R d R d E ( x 0 , x 1 , x t ) f ( x t ) v t [ x 1 , x 0 ] ( x t ) 2 2 (67) v t x t := E x 0 , x 1 x t v t x 1 , x 0 x t x t (68) v t = argmin f : R d R d E x 0 , x 1 , x t f x t v t x 1 , x 0 x t 2 2 {:[(67)v_(t)^(**)(x_(t)):=E_(x_(0),x_(1)∣x_(t))[v_(t)^([x_(1),x_(0)])(x_(t))∣x_(t)]],[(68)Longrightarrowv_(t)^(**)=argmin_(f:R^(d)rarrR^(d))E_((x_(0),x_(1),x_(t)))||f(x_(t))-v_(t)^([x_(1),x_(0)])(x_(t))||_(2)^(2)]:}\begin{align*} v_{t}^{*}\left(x_{t}\right) & :=\underset{x_{0}, x_{1} \mid x_{t}}{\mathbb{E}}\left[v_{t}^{\left[x_{1}, x_{0}\right]}\left(x_{t}\right) \mid x_{t}\right] \tag{67}\\ \Longrightarrow v_{t}^{*} & =\underset{f: \mathbb{R}^{d} \rightarrow \mathbb{R}^{d}}{\operatorname{argmin}} \underset{\left(x_{0}, x_{1}, x_{t}\right)}{\mathbb{E}}\left\|f\left(x_{t}\right)-v_{t}^{\left[x_{1}, x_{0}\right]}\left(x_{t}\right)\right\|_{2}^{2} \tag{68} \end{align*}(67)vt(xt):=Ex0,x1xt[vt[x1,x0](xt)xt](68)vt=argminf:RdRdE(x0,x1,xt)f(xt)vt[x1,x0](xt)22
(by using the generic fact that argmin f E f ( x ) y 2 = E [ y x ] argmin f E f ( x ) y 2 = E [ y x ] argmin_(f)E||f(x)-y||^(2)=E[y∣x]\operatorname{argmin}_{f} \mathbb{E}\|f(x)-y\|^{2}=\mathbb{E}[y \mid x]argminfEf(x)y2=E[yx] ).
In words, Equation (68) says that to compute the loss of a model f θ f θ f_(theta)f_{\theta}fθ for a fixed time t t ttt, we should:
  1. Sample source and target points ( x 1 , x 0 ) x 1 , x 0 (x_(1),x_(0))\left(x_{1}, x_{0}\right)(x1,x0) from their joint distribution.
  2. Compute the point x t x t x_(t)x_{t}xt deterministically, by running 54 the pointwise
54 54 ^(54){ }^{54}54 If we chose linear pointwise flows, for example, this would mean x t t x 1 + ( 1 t ) x 0 x t t x 1 + ( 1 t ) x 0 x_(t)larr tx_(1)+(1-t)x_(0)x_{t} \leftarrow t x_{1}+(1-t) x_{0}xttx1+(1t)x0, via Equation (66).
  1. Evaluate the model's prediction at x t x t x_(t)x_{t}xt, as f θ ( x t ) f θ x t f_(theta)(x_(t))f_{\theta}\left(x_{t}\right)fθ(xt). Evaluate the deterministic vector v t [ x 1 , x 0 ] ( x t ) v t x 1 , x 0 x t v_(t)^([x_(1),x_(0)])(x_(t))v_{t}^{\left[x_{1}, x_{0}\right]}\left(x_{t}\right)vt[x1,x0](xt). Then compute L2 loss between these two quantities.
To sample from the trained model (our estimate of v t v t v_(t)^(**)v_{t}^{*}vt ), we first sample a source point x 1 q x 1 q x_(1)∼qx_{1} \sim qx1q, then transport it along the learnt flow to a target sample x 0 x 0 x_(0)x_{0}x0. Pseudocode listings 4 and 5 give the explicit procedures for training and sampling from flow-based models (including the special case of linear flows for concreteness; matching Algorithm 1 in Liu et al. [2022a].).

Summary

To summarize, here is how to learn a flow-matching generative model for target distribution p p ppp.
The Ingredients. We first choose:
  1. A source distribution q q qqq, from which we can efficiently sample (e.g. a standard Gaussian).
  2. A coupling Π q , p Π q , p Pi_(q,p)\Pi_{q, p}Πq,p between q q qqq and p p ppp, which specifies a way to jointly sample a pair of source and target points ( x 1 , x 0 ) x 1 , x 0 (x_(1),x_(0))\left(x_{1}, x_{0}\right)(x1,x0) with marginals q q qqq and p p ppp respectively. A standard choice is the independent coupling, i.e. sample x 1 q x 1 q x_(1)∼qx_{1} \sim qx1q and x 0 p x 0 p x_(0)∼px_{0} \sim px0p independently.
  3. For all pairs of points ( x 1 , x 0 ) x 1 , x 0 (x_(1),x_(0))\left(x_{1}, x_{0}\right)(x1,x0), an explicit pointwise flow v [ x 1 , x 0 ] v x 1 , x 0 v^([x_(1),x_(0)])v^{\left[x_{1}, x_{0}\right]}v[x1,x0] which transports x 1 x 1 x_(1)x_{1}x1 to x 0 x 0 x_(0)x_{0}x0. We must be able to efficiently compute the vector field v t [ x 1 , x 0 ] v t x 1 , x 0 v_(t)^([x_(1),x_(0)])v_{t}^{\left[x_{1}, x_{0}\right]}vt[x1,x0] at all points.
These ingredients determine, in theory, a marginal vector field v v v^(**)v^{*}v which transports q q qqq to p p ppp :
(69) v t ( x t ) := E x 0 , x 1 x t [ v t [ x 1 , x 0 ] ( x t ) x t ] (69) v t x t := E x 0 , x 1 x t v t x 1 , x 0 x t x t {:(69)v_(t)^(**)(x_(t)):=E_(x_(0),x_(1)∣x_(t))[v_(t)^([x_(1),x_(0)])(x_(t))∣x_(t)]:}\begin{equation*} v_{t}^{*}\left(x_{t}\right):=\underset{x_{0}, x_{1} \mid x_{t}}{\mathbb{E}}\left[v_{t}^{\left[x_{1}, x_{0}\right]}\left(x_{t}\right) \mid x_{t}\right] \tag{69} \end{equation*}(69)vt(xt):=Ex0,x1xt[vt[x1,x0](xt)xt]
where the expectation is w.r.t. the joint distribution:
( x 1 , x 0 ) Π q , p x t := RunFlow ( v [ x 1 , x 0 ] , x 1 , t ) x 1 , x 0 Π q , p x t := RunFlow v x 1 , x 0 , x 1 , t {:[(x_(1),x_(0))∼Pi_(q,p)],[x_(t):=RunFlow(v^([x_(1),x_(0)]),x_(1),t)]:}\begin{aligned} \left(x_{1}, x_{0}\right) & \sim \Pi_{q, p} \\ x_{t} & :=\operatorname{RunFlow}\left(v^{\left[x_{1}, x_{0}\right]}, x_{1}, t\right) \end{aligned}(x1,x0)Πq,pxt:=RunFlow(v[x1,x0],x1,t)
Training. Train a neural network f θ f θ f_(theta)f_{\theta}fθ by backpropogating the stochastic loss function computed by Pseudocode 4. The optimal function for this expected loss is: f θ ( x t , t ) = v t ( x t ) f θ x t , t = v t x t f_(theta)(x_(t),t)=v_(t)^(**)(x_(t))f_{\theta}\left(x_{t}, t\right)=v_{t}^{*}\left(x_{t}\right)fθ(xt,t)=vt(xt).
Sampling. Run Pseudocode 5 to generate a sample x 0 x 0 x_(0)x_{0}x0 from (approimately) the target distribution p p ppp.

Pseudocode 4: Flow-matching train loss generic pointwise flow [or linear flow]

Input: Neural network f θ f θ f_(theta)f_{\theta}fθ
Data: Sample-access to coupling Π q , p Π q , p Pi_(q,p)\Pi_{q, p}Πq,p;
Pointwise flows { v t [ x 1 , x 0 ] } v t x 1 , x 0 {v_(t)^([x_(1),x_(0)])}\left\{v_{t}^{\left[x_{1}, x_{0}\right]}\right\}{vt[x1,x0]} for all x 1 , x 0 x 1 , x 0 x_(1),x_(0)x_{1}, x_{0}x1,x0.
Output: Stochastic loss L L LLL
1 ( x 1 , x 0 ) Sample ( Π q , p ) 1 x 1 , x 0 Sample Π q , p _(1)(x_(1),x_(0))larr Sample(Pi_(q,p)){ }_{1}\left(x_{1}, x_{0}\right) \leftarrow \operatorname{Sample}\left(\Pi_{q, p}\right)1(x1,x0)Sample(Πq,p)
2 t Unif [ 0 , 1 ] 2 t Unif [ 0 , 1 ] 2t larr Unif[0,1]2 t \leftarrow \operatorname{Unif}[0,1]2tUnif[0,1]
3 x t RunFlow ( v [ x 1 , x 0 ] , x 1 , t ) t x 1 + ( 1 t ) x 0 3 x t RunFlow v x 1 , x 0 , x 1 , t t x 1 + ( 1 t ) x 0 3x_(t)larrubrace(RunFlow(v^([x_(1),x_(0)]),x_(1),t)ubrace)_(tx_(1)+(1-t)x_(0))3 x_{t} \leftarrow \underbrace{\operatorname{RunFlow}\left(v^{\left[x_{1}, x_{0}\right]}, x_{1}, t\right)}_{t x_{1}+(1-t) x_{0}}3xtRunFlow(v[x1,x0],x1,t)tx1+(1t)x0
4 L f θ ( x t , t ) v t [ x 1 , x 0 ] ( x t ) ( x 0 x 1 ) 2 2 4 L f θ x t , t v t x 1 , x 0 x t x 0 x 1 2 2 _(4)L larr||f_(theta)(x_(t),t)-ubrace(v_(t)^([x_(1),x_(0)])(x_(t))ubrace)_((x_(0)-x_(1)))||_(2)^(2){ }_{4} L \leftarrow\|f_{\theta}\left(x_{t}, t\right)-\underbrace{v_{t}^{\left[x_{1}, x_{0}\right]}\left(x_{t}\right)}_{\left(x_{0}-x_{1}\right)}\|_{2}^{2}4Lfθ(xt,t)vt[x1,x0](xt)(x0x1)22
5 return L L LLL

4.6 DDIM as Flow Matching [Optional]

The DDIM algorithm of Section 3 can be seen as a special case of flow matching, for a particular choice of pointwise flows and coupling. We describe the exact correspondence here, which will allow us to notice an interesting relation between DDIM and linear flows.
We claim DDIM is equivalent to flow-matching with the following parameters:
  1. Pointwise Flows: Either of the two equivalent pointwise flows:
(70) v t [ x 1 , x 0 ] ( x t ) := 1 2 t ( x t x 0 ) (70) v t x 1 , x 0 x t := 1 2 t x t x 0 {:(70)v_(t)^([x_(1),x_(0)])(x_(t)):=(1)/(2t)(x_(t)-x_(0)):}\begin{equation*} v_{t}^{\left[x_{1}, x_{0}\right]}\left(x_{t}\right):=\frac{1}{2 t}\left(x_{t}-x_{0}\right) \tag{70} \end{equation*}(70)vt[x1,x0](xt):=12t(xtx0)
or
(71) v t [ x 1 , x 0 ] ( x t ) := 1 2 t ( x 0 x 1 ) (71) v t x 1 , x 0 x t := 1 2 t x 0 x 1 {:(71)v_(t)^([x_(1),x_(0)])(x_(t)):=(1)/(2sqrtt)(x_(0)-x_(1)):}\begin{equation*} v_{t}^{\left[x_{1}, x_{0}\right]}\left(x_{t}\right):=\frac{1}{2 \sqrt{t}}\left(x_{0}-x_{1}\right) \tag{71} \end{equation*}(71)vt[x1,x0](xt):=12t(x0x1)
which both generate the trajectory 55 :
(72) x t = x 0 + ( x 1 x 0 ) t (72) x t = x 0 + x 1 x 0 t {:(72)x_(t)=x_(0)+(x_(1)-x_(0))sqrtt:}\begin{equation*} x_{t}=x_{0}+\left(x_{1}-x_{0}\right) \sqrt{t} \tag{72} \end{equation*}(72)xt=x0+(x1x0)t
  1. Coupling: The "diffusion coupling" - that is, the joint distribution on ( x 0 , x 1 ) x 0 , x 1 (x_(0),x_(1))\left(x_{0}, x_{1}\right)(x0,x1) generated by
(73) x 0 p ; x 1 x 0 + N ( 0 , σ q 2 ) (73) x 0 p ; x 1 x 0 + N 0 , σ q 2 {:(73)x_(0)∼p;quadx_(1)larrx_(0)+N(0,sigma_(q)^(2)):}\begin{equation*} x_{0} \sim p ; \quad x_{1} \leftarrow x_{0}+\mathcal{N}\left(0, \sigma_{q}^{2}\right) \tag{73} \end{equation*}(73)x0p;x1x0+N(0,σq2)
This claim is straightforward to prove (see Appendix B.5), but the implication is somewhat surprising: we can recover the DDIM trajectories (which are not straight in general) as a combination of the straight pointwise trajectories in Equation (72). In fact, the DDIM trajectories are exactly equivalent to flow-matching trajectories for the above linear flows, with a different scaling of time ( t vs. t ) 56 ( t  vs.  t ) 56 (sqrtt" vs. "t)^(56)(\sqrt{t} \text { vs. } t)^{56}(t vs. t)56.
Claim 4 (DDIM as Linear Flow; Informal). The DDIM sampler (Algorithm 2) is equivalent, up to time-reparameterization, to the marginal flow produced by linear pointwise flows (Equation 65) with the diffusion coupling (Equation 73).
A formal statement of this claim 57 is provided in Appendix B.7.

4.7 Additional Remarks and References [Optional]

  • See Figure 11 for a diagram of the different methods described in this tutorial, and their relations.
  • We highly recommend the flow-matching tutorial of Fjelde et al. [2024], which includes helpful visualizations of flows, and uses notation more consistent with the current literature.
  • As a curiosity, note that we never had to define an explicit "forward process" for flow-matching, as we did for Gaussian diffusion. Rather, it was enough to define the appropriate "reverse processes" (via flows).
  • What we called pointwise flows are also called two-sided conditional flows in the literature, and was developed in Albergo and VandenEijnden [2022], Pooladian et al. [2023], Liu et al. [2022a], Tong et al. [2023].
  • Albergo et al. [2023] define the framework of stochastic interpolants, which can be thought of as considering stochastic pointwise flows, instead of only deterministic ones. Their framework strictly generalizes both DDPM and DDIM.
  • See Stark et al. [2024] for an interesting example of non-standard flows. They derive a generative model for discrete spaces by embedding into a continuous space (the probability simplex), then constructing a special flow on these simplices. 56 56 ^(56){ }^{56}56 DDIM at time t t ttt corresponds to the linear flow at time t t sqrtt\sqrt{t}t; thus linear flows are "slower" than DDIM when t t ttt is small. This may be beneficial for linear flows in practice (speculatively).
Figure 10: The trajectories of individual samples x 1 q x 1 q x_(1)∼qx_{1} \sim qx1q for the flow in Figure 7.

5 Diffusion in Practice

To conclude, we mention some aspects of diffusion which are important in practice, but were not covered in this tutorial.
Samplers in Practice. Our DDPM and DDIM samplers (algorithms 2 and 3) correspond to the samplers presented in Ho et al. [2020] and Song et al. [2021], respectively, but with different choice of schedule and parametrization (see footnote 13). DDPM and DDIM were some of the earliest samplers to be used in practice, but since then there has been significant progress in samplers for fewer-step generation (which is crucial since each step requires a typically-expensive model forward-pass). 5 8 5 8 5^(8)5^{8}58 In sections 2.4 and 3.5, we showed that DDPM and DDIM can be seen as discretizations of the reverse SDE and Probability Flow ODE, respectively. The SDE and ODE perspectives automatically lead to many samplers corresponding to different black-box SDE and ODE numerical solvers (such as Euler, Heun, and RungeKutta). It is also possible to take advantage of the specific structure of the diffusion ODE, to improve upon black-box solvers [Lu et al., 2022a,b, Zhang and Chen, 2023].
Noise Schedules. The noise schedule typically refers to σ t σ t sigma_(t)\sigma_{t}σt, which determines the amount of noise added at time t t ttt of the diffusion process. The simple diffusion (1) has p ( x t ) N ( x 0 , σ t 2 ) p x t N x 0 , σ t 2 p(x_(t))∼N(x_(0),sigma_(t)^(2))p\left(x_{t}\right) \sim \mathcal{N}\left(x_{0}, \sigma_{t}^{2}\right)p(xt)N(x0,σt2) with σ t t σ t t sigma_(t)propsqrtt\sigma_{t} \propto \sqrt{t}σtt. Notice that the variance of x t x t x_(t)x_{t}xt increases at every timestep. 59 59 ^(59){ }^{59}59
In practice, schedules with controlled variance are often preferred. One of the most popular schedules, introduced in Ho et al. [2020], uses a time-dependent variance and scaling such that the variance of x t x t x_(t)x_{t}xt remains bounded. Their discrete update is
(74) x t = 1 β ( t ) x t 1 + β ( t ) ε t ; ε t N ( 0 , 1 ) (74) x t = 1 β ( t ) x t 1 + β ( t ) ε t ; ε t N ( 0 , 1 ) {:(74)x_(t)=sqrt(1-beta(t))x_(t-1)+sqrt(beta(t))epsi_(t);quadepsi_(t)∼N(0","1):}\begin{equation*} x_{t}=\sqrt{1-\beta(t)} x_{t-1}+\sqrt{\beta(t)} \varepsilon_{t} ; \quad \varepsilon_{t} \sim \mathcal{N}(0,1) \tag{74} \end{equation*}(74)xt=1β(t)xt1+β(t)εt;εtN(0,1)
where 0 < β ( t ) < 1 0 < β ( t ) < 1 0 < beta(t) < 10<\beta(t)<10<β(t)<1 is chosen so that x t x t x_(t)x_{t}xt is (very close to) clean data at t = 1 t = 1 t=1t=1t=1 and pure noise at t = T t = T t=Tt=Tt=T.
The general SDE (26) introduced in 2.4 offers additional flexibility. Our simple diffusion (1) has f = 0 , g = σ q f = 0 , g = σ q f=0,g=sigma_(q)f=0, g=\sigma_{q}f=0,g=σq, while the diffusion (74) of Ho et al. [2020] has f = 1 2 β ( t ) , g = β ( t ) f = 1 2 β ( t ) , g = β ( t ) f=-(1)/(2)beta(t),g=sqrt(beta(t))f=-\frac{1}{2} \beta(t), g=\sqrt{\beta(t)}f=12β(t),g=β(t). Karras et al. [2022] reparametrize the SDE in terms of an overall scaling s ( t ) s ( t ) s(t)s(t)s(t) and variance σ ( t ) σ ( t ) sigma(t)\sigma(t)σ(t) of x t x t x_(t)x_{t}xt, as a more interpretable way to think about diffusion designs, and suggest a schedule with s ( t ) = 1 , σ ( t ) = t s ( t ) = 1 , σ ( t ) = t s(t)=1,sigma(t)=ts(t)=1, \sigma(t)=ts(t)=1,σ(t)=t (which corresponds to f = 0 , g = 2 t f = 0 , g = 2 t f=0,g=sqrt(2t)f=0, g=\sqrt{2 t}f=0,g=2t ). Generally, the choice of f , g f , g f,gf, gf,g, or equivalently s , σ s , σ s,sigmas, \sigmas,σ, offers a convenient way to explore the design-space of possible schedules.

Abstract

58 58 ^(58){ }^{58}58 Even the best samplers still require around 10 sampling steps, which may be impractical. A variety of time distillation methods seek to train onestep-generator student models to match the output of diffusion teacher models, with the goal of high-quality sampling in one (or few) steps. Some examples include consistency models [Song et al., 2023b] and adversarial distillation methods [Lin et al., 2024, Xu et al., 2023, Sauer et al., 2024]. Note, however, that the distilled models are no longer diffusion models, nor are their samplers (even if multi-step) diffusion samplers.

59 59 ^(59){ }^{59}59 Song et al. [2020] made the distinction between "variance-exploding" (VE) and "variance-preserving" (VP) schedules while comparing SMLD [Song and Ermon, 2019] and DDPM [Ho et al., 2020]. The terms VE and VP often refer specifically to SMLD and DDPM, respectively. Our diffusion (1) could also be called a variance-exploding schedule, though our noise schedule differs from the one originally proposed in Song and Ermon [2019]
Likelihood Interpretations and VAEs. One popular and useful interpretation of diffusion models is the Variational Auto Encoder (VAE) perspective 60 60 ^(60){ }^{60}60. Briefly, diffusion models can be viewed as a special case of a deep hierarchical VAE, where each diffusion timestep corresponds to one "layer" of the VAE decoder. The corresponding VAE encoder is given by the forward diffusion process, which produces the sequence of noisy { x t } x t {x_(t)}\left\{x_{t}\right\}{xt} as the "latents" for input x x xxx. Notably, the VAE encoder here is not learnt, unlike usual VAEs. Because of the Markovian structure of the latents, each layer of the VAE decoder can be trained in isolation, without forward/backward passing through all previous layers; this helps with the notorious training instability of deep VAEs. We recommend the tutorials of Turner [2021] and Luo [2022] for more details on the VAE perspective.
One advantage of the VAE interpretation is, it gives us an estimate of the data likelihood under our generative model, by using the standard Evidence-Based-Lower-Bound (ELBO) for VAEs. This allows us to train diffusion models directly using a maximum-likelihood objective. It turns out that the ELBO for the diffusion VAE reduces to exactly the L2 regression loss that we presented, but with a particular time-weighting that weights the regression loss differently at different time-steps t t ttt. For example, regression errors at large times t t ttt (i.e. at high noise levels) may need to be weighted differently from errors at small times, in order for the overall loss to properly reflect a likelihood. 61 61 ^(61){ }^{61}61 The best choice of time-weighting in practice, however, is still up for debate: the "principled" choice informed by the VAE interpretation does not always produce the best generated samples 62 62 ^(62){ }^{62}62. See Kingma and Gao [2023] for a good discussion of different weightings and their effect.
Parametrization: x 0 / ε / v x 0 / ε / v x_(0)//epsi//vx_{0} / \varepsilon / vx0/ε/v-prediction. Another important practical choice is which of several closely-related quantities - partiallydenoised data, fully-denoised data, or the noise itself - we ask the network to predict. 63 63 ^(63){ }^{63}63 Recall that in DDPM Training (Algorithm 1), we asked the network f θ f θ f_(theta)f_{\theta}fθ to learn to predict E [ x t Δ t x t ] E x t Δ t x t E[x_(t-Delta t)∣x_(t)]\mathbb{E}\left[x_{t-\Delta t} \mid x_{t}\right]E[xtΔtxt] by minimizing f θ ( x t , t ) x t Δ t 2 2 f θ x t , t x t Δ t 2 2 ||f_(theta)(x_(t),t)-x_(t-Delta t)||_(2)^(2)\left\|f_{\theta}\left(x_{t}, t\right)-x_{t-\Delta t}\right\|_{2}^{2}fθ(xt,t)xtΔt22. However, other parametrizations are possible. For example, recalling that E [ x t Δ t x t x t ] = eq. 23 Δ t t E [ x 0 x t x t ] E x t Δ t x t x t =  eq.  23 Δ t t E x 0 x t x t E[x_(t-Delta t)-x_(t)∣x_(t)]=^(" eq. "23)(Delta t)/(t)E[x_(0)-x_(t)∣x_(t)]\mathbb{E}\left[x_{t-\Delta t}-x_{t} \mid x_{t}\right] \stackrel{\text { eq. } 23}{=} \frac{\Delta t}{t} \mathbb{E}\left[x_{0}-x_{t} \mid x_{t}\right]E[xtΔtxtxt]= eq. 23ΔttE[x0xtxt], we see that that
min θ f θ ( x t , t ) x 0 2 2 f θ ( x t , t ) = E [ x 0 x t ] min θ f θ x t , t x 0 2 2 f θ x t , t = E x 0 x t min_(theta)||f_(theta)(x_(t),t)-x_(0)||_(2)^(2)Longrightarrowf_(theta)^(***)(x_(t),t)=E[x_(0)∣x_(t)]\min _{\theta}\left\|f_{\theta}\left(x_{t}, t\right)-x_{0}\right\|_{2}^{2} \Longrightarrow f_{\theta}^{\star}\left(x_{t}, t\right)=\mathbb{E}\left[x_{0} \mid x_{t}\right]minθfθ(xt,t)x022fθ(xt,t)=E[x0xt]
is a (nearly) equivalent problem, which is often called x 0 x 0 x_(0)x_{0}x0-prediction. 64 64 ^(64){ }^{64}64 The objectives differ only by a time-weighting factor of 1 t 1 t (1)/(t)\frac{1}{t}1t. Similarly, defining the noise ε t = 1 σ t E [ x 0 x t x t ] ε t = 1 σ t E x 0 x t x t epsi_(t)=(1)/(sigma_(t))E[x_(0)-x_(t)∣x_(t)]\varepsilon_{t}=\frac{1}{\sigma_{t}} \mathbb{E}\left[x_{0}-x_{t} \mid x_{t}\right]εt=1σtE[x0xtxt], we see that we could alternatively ask the the network to predict E [ ε t x t ] E ε t x t E[epsi_(t)∣x_(t)]\mathbb{E}\left[\varepsilon_{t} \mid x_{t}\right]E[εtxt] : this is usually called 60 60 ^(60){ }^{60}60 This was actually the original approach to derive the diffusion objective function, in Sohl-Dickstein et al. [2015] and also Ho et al. [2020].
61 61 ^(61){ }^{61}61 See also Equation (5) in Kadkhodaie et al. [2024] for a simple bound on KL divergence between the true distribution and generated distribution, in terms of regression excess risks.
62 62 ^(62){ }^{62}62 For example, Ho et al. [2020] drops the time-weighting terms, and just uniformly weights all timesteps.
63 63 ^(63){ }^{63}63 More accurately, the network always predicts conditional expectations of these quantities
64 64 ^(64){ }^{64}64 This corresponds to the variancereduced algorithm (6).
e-prediction. Another parametrization, v-prediction, asks the model to predict v = α t ε σ t x 0 v = α t ε σ t x 0 v=alpha_(t)epsi-sigma_(t)x_(0)v=\alpha_{t} \varepsilon-\sigma_{t} x_{0}v=αtεσtx0 [Salimans and Ho, 2022] - mostly predicting data for high noise-levels and mostly noise for low noise-levels. All the parametrizations differ only by time-weightings (see Appendix B. 10 for more details).
Although the different time-weightings do not affect the optimal solution, they do impact training as discussed above. Furthermore, even if the time-weightings are adjusted to yield equivalent problems in principle, the different parametrizations may behave differently in practice, since learning is not perfect and certain objectives may be more robust to error. For example, x 0 x 0 x_(0)x_{0}x0-prediction combined with a schedule that places a lot of weight on low noise levels may not work well in practice, since for low noise the identity function can achieve a relatively low objective value, but clearly is not what we want.
Sources of Error. Finally, when using diffusion and flow models in practice, there are a number of sources of error which prevent the learnt generative model from exactly producing the target distribution. These can be roughly segregated into training-time and sampling-time errors.
  1. Train-time error: Regression errors in learning the populationoptimal regression function. The regression objective is the marginal flow v t v t v_(t)^(**)v_{t}^{*}vt in flow-matching, or the scores E [ x 0 x t ] E x 0 x t E[x_(0)∣x_(t)]\mathbb{E}\left[x_{0} \mid x_{t}\right]E[x0xt] in diffusion models. For each fixed time t t ttt, this a standard kind of statistical error. It depends on the neural network architecture and size as well as the number of samples, and can be decomposed further into approximation and estimation errors in the usual way (e.g. see Advani et al. [2020, Sec. 4] decomposing a 2-layer network into approximation error and over-fitting error).
  2. Sampling-time error: Discretization errors from using finite stepsizes Δ t Δ t Delta t\Delta tΔt. This error is exactly the discretization error of the ODE or SDE solver used in sampling. These errors manifest in different ways: for DDPM, this reflects the error in using a Gaussian approximation of the reverse process (i.e. Fact 1 breaks for large σ σ sigma\sigmaσ ). For DDIM and flow matching, it reflects the error in simulating continuous-time flows in discrete time.
These errors interact and compound in nontrivial ways, which are not yet fully understood. For example, it is not clear exactly how train-time error in the regression estimates translates into distributional error of the entire generative model. (And this question itself is complicated, since it is not always clear what type of distributional divergence we care about in practice). Interestingly, these "errors" can also have a beneficial effect on small train sets, because they act
as a kind of regularization which prevents the diffusion model from just memorizing the train samples (as discussed in Section 3.7).

Conclusion

We have now covered the basics of diffusion models and flow matching. This is an active area of research, and there are many interesting aspects and open questions which we did not cover (see Page 36 for recommended reading). We hope the foundations here equip the reader to understand more advanced topics in diffusion modeling, and perhaps contribute to the research themselves.
Figure 11: Commutative diagram of the different reverse samplers described in this tutorial, and their relations. Each deterministic sampler produces identical marginal distributions as its stochastic counterpart. There are also various ways to construct stochastic versions of flows, which are not pictured here (e.g. Albergo et al. [2023]).

A Additional Resources

Several other helpful resources for learning diffusion (tutorials, blogs, papers), roughly in order of mathematical background required.
  1. Perspectives on diffusion.
Dieleman [2023]. (Webpage.)
Overview of many interpretations of diffusion, and techniques.
  1. Tutorial on Diffusion Models for Imaging and Vision.
Chan [2024]. (49 pgs.)
More focus on intuitions and applications.
  1. Interpreting and improving diffusion models using the euclidean distance function.
Permenter and Yuan [2023]. (Webpage.)
Distance-field interpretation. See accompanying blog with simple code [Yuan, 2024].
  1. On the Mathematics of Diffusion Models.
McAllester [2023]. (4 pgs.)
Short and accessible.
  1. Building Diffusion Model's theory from ground up
Das [2024]. (Webpage.)
ICLR 2024 Blogposts Track. Focus on SDE and score-matching perspective.
  1. Denoising Diffusion Models: A Generative Learning Big Bang.
Song, Meng, and Vahdat [2023a]. (Video, 3 hrs.)
CVPR 2023 tutorial, with recording.
  1. Diffusion Models From Scratch.
Duan [2023]. (Webpage, 10 parts.)
Fairly complete on topics, includes: DDPM, DDIM, Karras et al. [2022], SDE/ODE solvers. Includes practical remarks and code.
  1. Understanding Diffusion Models: A Unified Perspective.
Luo [2022]. (22 pgs.)
Focus on VAE interpretation, with explicit math details.
  1. Demystifying Variational Diffusion Models.
Ribeiro and Glocker [2024]. (44 pgs.)
Focus on VAE interpretation, with explicit math details.
  1. Diffusion and Score-Based Generative Models.
Song [2023]. (Video, 1.5 hrs.)
Discusses several interpretations, applications, and comparisons to other generative modeling methods.
  1. Deep Unsupervised Learning using Nonequilibrium Thermodynamics
Sohl-Dickstein, Weiss, Maheswaranathan, and Ganguli [2015]. (9 pgs + Appendix)
Original paper introducing diffusion models for ML. Includes unified description of discrete diffusion (i.e. diffusion on discrete state spaces).
  1. An Introduction to Flow Matching.
Fjelde, Mathieu, and Dutordoir [2024]. (Webpage.)
Insightful figures and animations, with rigorous mathematical exposition.
  1. Elucidating the Design Space of Diffusion-Based Generative Models.
Karras, Aittala, Aila, and Laine [2022]. (10 pgs + Appendix.)
Discusses the effect of various design choices such as noise schhedule, parameterization, ODE solver, etc. Presents a generalized framework that captures many choices.
  1. Denoising Diffusion Models
Peyré [2023]. (4 pgs.)
Fast-track through the mathematics, for readers already comfortable with Langevin dynamics and SDEs.
  1. Generative Modeling by Estimating Gradients of the Data Distribution.
Song, Sohl-Dickstein, Kingma, Kumar, Ermon, and Poole [2020]. (9 pgs + Appendix.)
Presents the connections between SDEs, ODEs, DDIM, and DDPM.
  1. Stochastic Interpolants: A Unifying Framework for Flows and Diffusions.
Albergo, Boffi, and Vanden-Eijnden [2023]. (46 pgs + Appendix.)
Presents a general framework that captures many diffusion variants, and learning objectives. For readers comfortable with SDEs
  1. Sampling, Diffusions, and Stochastic Localization.
Montanari [2023]. (22 pgs + Appendix.)
Presents diffusion as a special case of "stochastic localization," a technique used in high-dimensional statistics to establish mixing of Markov chains.

B Omitted Derivations

B. 1 KL Error in Gaussian Approximation of Reverse Process

Here we prove Lemma 1, restated below.
Lemma 2. Let p ( x ) p ( x ) p(x)p(x)p(x) be an arbitrary density over R R R\mathbb{R}R, with bounded 1st to 4 th 4 th  4^("th ")4^{\text {th }}4th  order derivatives. Consider the joint distribution ( x 0 , x 1 ) x 0 , x 1 (x_(0),x_(1))\left(x_{0}, x_{1}\right)(x0,x1), where x 0 p x 0 p x_(0)∼px_{0} \sim px0p and x 1 x 0 + N ( 0 , σ 2 ) x 1 x 0 + N 0 , σ 2 x_(1)∼x_(0)+N(0,sigma^(2))x_{1} \sim x_{0}+\mathcal{N}\left(0, \sigma^{2}\right)x1x0+N(0,σ2). Then, for any conditioning z R z R z inRz \in \mathbb{R}zR we have
KL ( N ( μ z , σ 2 ) p x 0 x 1 ( x 1 = z ) ) O ( σ 4 ) KL N μ z , σ 2 p x 0 x 1 x 1 = z O σ 4 KL(N(mu_(z),sigma^(2))||p_(x_(0)∣x_(1))(*∣x_(1)=z)) <= O(sigma^(4))\operatorname{KL}\left(\mathcal{N}\left(\mu_{z}, \sigma^{2}\right) \| p_{x_{0} \mid x_{1}}\left(\cdot \mid x_{1}=z\right)\right) \leq O\left(\sigma^{4}\right)KL(N(μz,σ2)px0x1(x1=z))O(σ4)
where
(76) μ z := z + σ 2 log p ( z ) (76) μ z := z + σ 2 log p ( z ) {:(76)mu_(z):=z+sigma^(2)grad log p(z):}\begin{equation*} \mu_{z}:=z+\sigma^{2} \nabla \log p(z) \tag{76} \end{equation*}(76)μz:=z+σ2logp(z)
Proof. WLOG, we can take z = 0 z = 0 z=0z=0z=0. We want to estimate the KL:
(77) K L ( N ( μ , σ 2 ) p ( x 0 = x 1 = 0 ) ) (77) K L N μ , σ 2 p x 0 = x 1 = 0 {:(77)KL(N(mu,sigma^(2))||p(x_(0)=*∣x_(1)=0)):}\begin{equation*} K L\left(\mathcal{N}\left(\mu, \sigma^{2}\right) \| p\left(x_{0}=\cdot \mid x_{1}=0\right)\right) \tag{77} \end{equation*}(77)KL(N(μ,σ2)p(x0=x1=0))
where we will let μ μ mu\muμ be arbitrary for now.
Let q := N ( μ , σ 2 ) q := N μ , σ 2 q:=N(mu,sigma^(2))q:=\mathcal{N}\left(\mu, \sigma^{2}\right)q:=N(μ,σ2), and p ( x ) =: exp ( F ( x ) ) p ( x ) =: exp ( F ( x ) ) p(x)=:exp(F(x))p(x)=: \exp (F(x))p(x)=:exp(F(x)). We have x 1 p N ( 0 , σ 2 ) x 1 p N 0 , σ 2 x_(1)∼p***N(0,sigma^(2))x_{1} \sim p \star \mathcal{N}\left(0, \sigma^{2}\right)x1pN(0,σ2). This implies:
(78) p ( x 1 = x ) = E η N ( 0 , σ 2 ) [ p ( x + η ) ] (78) p x 1 = x = E η N 0 , σ 2 [ p ( x + η ) ] {:(78)p(x_(1)=x)=E_(eta∼N(0,sigma^(2)))[p(x+eta)]:}\begin{equation*} p\left(x_{1}=x\right)=\underset{\eta \sim \mathcal{N}\left(0, \sigma^{2}\right)}{\mathbb{E}}[p(x+\eta)] \tag{78} \end{equation*}(78)p(x1=x)=EηN(0,σ2)[p(x+η)]
Let us first expand the logs of the two distributions we are comparing:
(79) log p ( x 0 = x x 1 = 0 ) (80) = log p ( x 1 = 0 x 0 = x ) + log p ( x 0 = x ) log p ( x 1 = 0 ) (81) = log ( σ 2 π ) 0.5 x 2 σ 2 + log p ( x 0 = x ) log p ( x 1 = 0 ) (82) = log ( σ 2 π ) 0.5 x 2 σ 2 + F ( x ) log p ( x 1 = 0 ) (79) log p x 0 = x x 1 = 0 (80) = log p x 1 = 0 x 0 = x + log p x 0 = x log p x 1 = 0 (81) = log ( σ 2 π ) 0.5 x 2 σ 2 + log p x 0 = x log p x 1 = 0 (82) = log ( σ 2 π ) 0.5 x 2 σ 2 + F ( x ) log p x 1 = 0 {:[(79)log p(x_(0)=x∣x_(1)=0)],[(80)=log p(x_(1)=0∣x_(0)=x)+log p(x_(0)=x)-log p(x_(1)=0)],[(81)=-log(sigmasqrt(2pi))-0.5x^(2)sigma^(-2)+log p(x_(0)=x)-log p(x_(1)=0)],[(82)=-log(sigmasqrt(2pi))-0.5x^(2)sigma^(-2)+F(x)-log p(x_(1)=0)]:}\begin{align*} & \log p\left(x_{0}=x \mid x_{1}=0\right) \tag{79}\\ & =\log p\left(x_{1}=0 \mid x_{0}=x\right)+\log p\left(x_{0}=x\right)-\log p\left(x_{1}=0\right) \tag{80}\\ & =-\log (\sigma \sqrt{2 \pi})-0.5 x^{2} \sigma^{-2}+\log p\left(x_{0}=x\right)-\log p\left(x_{1}=0\right) \tag{81}\\ & =-\log (\sigma \sqrt{2 \pi})-0.5 x^{2} \sigma^{-2}+F(x)-\log p\left(x_{1}=0\right) \tag{82} \end{align*}(79)logp(x0=xx1=0)(80)=logp(x1=0x0=x)+logp(x0=x)logp(x1=0)(81)=log(σ2π)0.5x2σ2+logp(x0=x)logp(x1=0)(82)=log(σ2π)0.5x2σ2+F(x)logp(x1=0)
And also:
(84) log q ( x ) = log ( σ 2 π ) 0.5 ( x μ ) 2 σ 2 (84) log q ( x ) = log ( σ 2 π ) 0.5 ( x μ ) 2 σ 2 {:(84)log q(x)=-log(sigmasqrt(2pi))-0.5(x-mu)^(2)sigma^(-2):}\begin{equation*} \log q(x)=-\log (\sigma \sqrt{2 \pi})-0.5(x-\mu)^{2} \sigma^{-2} \tag{84} \end{equation*}(84)logq(x)=log(σ2π)0.5(xμ)2σ2
Now we can expand the KL:
(85) K L ( q p ( x 0 = x 1 = 0 ) ) (86) = E x q [ log q ( x ) log p ( x 0 = x x 1 = 0 ) ] (87) = E x q [ log ( σ 2 π ) 0.5 ( x μ ) 2 σ 2 ( log ( σ 2 π ) 0.5 x 2 σ 2 + F ( x ) log p ( x 1 = 0 ) ) ] (88) = E x q [ 0.5 ( x μ ) 2 σ 2 + 0.5 x 2 σ 2 F ( x ) + log p ( x 1 = 0 ) ) ] (work) = E η N ( 0 , σ 2 ) ; x = μ + η [ 0.5 η 2 σ 2 + 0.5 x 2 σ 2 F ( x ) + log p ( x 1 = 0 ) ) ] (89) = 0.5 E [ η 2 ] σ 2 + 0.5 E [ x 2 ] σ 2 E [ F ( x ) ] + log p ( x 1 = 0 ) ) ] (90) = 0.5 σ 2 σ 2 + 0.5 ( σ 2 + μ 2 ) σ 2 E [ F ( x ) ] + log p ( x 1 = 0 ) ) ] (91) = 0.5 μ 2 σ 2 + log p ( x 1 = 0 ) E x q [ F ( x ) ] (92) 0.5 μ 2 σ 2 + log p ( x 1 = 0 ) E x q [ F ( 0 ) + F ( 0 ) x + 0.5 F ( 0 ) x 2 + O ( x 3 ) + O ( x 4 ) ] (93) = log p ( x 1 = 0 ) + 0.5 μ 2 σ 2 F ( 0 ) F ( 0 ) μ 0.5 F ( 0 ) ( μ 2 + σ 2 ) + O ( σ 2 μ + μ 2 + σ 4 ) (85) K L q p x 0 = x 1 = 0 (86) = E x q log q ( x ) log p x 0 = x x 1 = 0 (87) = E x q log ( σ 2 π ) 0.5 ( x μ ) 2 σ 2 log ( σ 2 π ) 0.5 x 2 σ 2 + F ( x ) log p x 1 = 0 (88) = E x q 0.5 ( x μ ) 2 σ 2 + 0.5 x 2 σ 2 F ( x ) + log p x 1 = 0 (work) = E η N 0 , σ 2 ; x = μ + η 0.5 η 2 σ 2 + 0.5 x 2 σ 2 F ( x ) + log p x 1 = 0 (89) = 0.5 E η 2 σ 2 + 0.5 E x 2 σ 2 E [ F ( x ) ] + log p x 1 = 0 (90) = 0.5 σ 2 σ 2 + 0.5 σ 2 + μ 2 σ 2 E [ F ( x ) ] + log p x 1 = 0 (91) = 0.5 μ 2 σ 2 + log p x 1 = 0 E x q [ F ( x ) ] (92) 0.5 μ 2 σ 2 + log p x 1 = 0 E x q F ( 0 ) + F ( 0 ) x + 0.5 F ( 0 ) x 2 + O x 3 + O x 4 (93) = log p x 1 = 0 + 0.5 μ 2 σ 2 F ( 0 ) F ( 0 ) μ 0.5 F ( 0 ) μ 2 + σ 2 + O σ 2 μ + μ 2 + σ 4 {:[(85)KL(q||p(x_(0)=*∣x_(1)=0))],[(86)=E_(x∼q)[log q(x)-log p(x_(0)=x∣x_(1)=0)]],[(87)=E_(x∼q)[-log(sigmasqrt(2pi))-0.5(x-mu)^(2)sigma^(-2)-(-log(sigmasqrt(2pi))-0.5x^(2)sigma^(-2)+F(x)-log p(x_(1)=0))]],[(88){:=E_(x∼q)[-0.5(x-mu)^(2)sigma^(-2)+0.5x^(2)sigma^(-2)-F(x)+log p(x_(1)=0))]],[(work){:=E_(eta∼N(0,sigma^(2));x=mu+eta)[-0.5eta^(2)sigma^(-2)+0.5x^(2)sigma^(-2)-F(x)+log p(x_(1)=0))]],[(89){:=-0.5E[eta^(2)]sigma^(-2)+0.5E[x^(2)]sigma^(-2)-E[F(x)]+log p(x_(1)=0))]],[(90){:=-0.5sigma^(2)sigma^(-2)+0.5(sigma^(2)+mu^(2))sigma^(-2)-E[F(x)]+log p(x_(1)=0))]],[(91)=0.5mu^(2)sigma^(-2)+log p(x_(1)=0)-E_(x∼q)[F(x)]],[(92)~~0.5mu^(2)sigma^(-2)+log p(x_(1)=0)-E_(x∼q)[F(0)+F^(')(0)x+0.5F^('')(0)x^(2)+O(x^(3))+O(x^(4))]],[(93)=log p(x_(1)=0)+0.5mu^(2)sigma^(-2)-F(0)-F^(')(0)mu-0.5F^('')(0)(mu^(2)+sigma^(2))+O(sigma^(2)mu+mu^(2)+sigma^(4))]:}\begin{align*} & K L\left(q \| p\left(x_{0}=\cdot \mid x_{1}=0\right)\right) \tag{85}\\ & =\underset{x \sim q}{\mathbb{E}}\left[\log q(x)-\log p\left(x_{0}=x \mid x_{1}=0\right)\right] \tag{86}\\ & =\underset{x \sim q}{\mathbb{E}}\left[-\log (\sigma \sqrt{2 \pi})-0.5(x-\mu)^{2} \sigma^{-2}-\left(-\log (\sigma \sqrt{2 \pi})-0.5 x^{2} \sigma^{-2}+F(x)-\log p\left(x_{1}=0\right)\right)\right] \tag{87}\\ & \left.=\underset{x \sim q}{\mathbb{E}}\left[-0.5(x-\mu)^{2} \sigma^{-2}+0.5 x^{2} \sigma^{-2}-F(x)+\log p\left(x_{1}=0\right)\right)\right] \tag{88}\\ & \left.=\underset{\eta \sim \mathcal{N}\left(0, \sigma^{2}\right) ; x=\mu+\eta}{\mathbb{E}}\left[-0.5 \eta^{2} \sigma^{-2}+0.5 x^{2} \sigma^{-2}-F(x)+\log p\left(x_{1}=0\right)\right)\right] \tag{work}\\ & \left.\left.=-0.5 \mathbb{E}\left[\eta^{2}\right] \sigma^{-2}+0.5 \mathbb{E}\left[x^{2}\right] \sigma^{-2}-\mathbb{E}[F(x)]+\log p\left(x_{1}=0\right)\right)\right] \tag{89}\\ & \left.\left.=-0.5 \sigma^{2} \sigma^{-2}+0.5\left(\sigma^{2}+\mu^{2}\right) \sigma^{-2}-\mathbb{E}[F(x)]+\log p\left(x_{1}=0\right)\right)\right] \tag{90}\\ & =0.5 \mu^{2} \sigma^{-2}+\log p\left(x_{1}=0\right)-\underset{x \sim q}{\mathbb{E}}[F(x)] \tag{91}\\ & \approx 0.5 \mu^{2} \sigma^{-2}+\log p\left(x_{1}=0\right)-\underset{x \sim q}{\mathbb{E}}\left[F(0)+F^{\prime}(0) x+0.5 F^{\prime \prime}(0) x^{2}+O\left(x^{3}\right)+O\left(x^{4}\right)\right] \tag{92}\\ & =\log p\left(x_{1}=0\right)+0.5 \mu^{2} \sigma^{-2}-F(0)-F^{\prime}(0) \mu-0.5 F^{\prime \prime}(0)\left(\mu^{2}+\sigma^{2}\right)+O\left(\sigma^{2} \mu+\mu^{2}+\sigma^{4}\right) \tag{93} \end{align*}(85)KL(qp(x0=x1=0))(86)=Exq[logq(x)logp(x0=xx1=0)](87)=Exq[log(σ2π)0.5(xμ)2σ2(log(σ2π)0.5x2σ2+F(x)logp(x1=0))](88)=Exq[0.5(xμ)2σ2+0.5x2σ2F(x)+logp(x1=0))](work)=EηN(0,σ2);x=μ+η[0.5η2σ2+0.5x2σ2F(x)+logp(x1=0))](89)=0.5E[η2]σ2+0.5E[x2]σ2E[F(x)]+logp(x1=0))](90)=0.5σ2σ2+0.5(σ2+μ2)σ2E[F(x)]+logp(x1=0))](91)=0.5μ2σ2+logp(x1=0)Exq[F(x)](92)0.5μ2σ2+logp(x1=0)Exq[F(0)+F(0)x+0.5F(0)x2+O(x3)+O(x4)](93)=logp(x1=0)+0.5μ2σ2F(0)F(0)μ0.5F(0)(μ2+σ2)+O(σ2μ+μ2+σ4)
We will now estimate the first term, log p ( x 1 = 0 ) log p x 1 = 0 log p(x_(1)=0)\log p\left(x_{1}=0\right)logp(x1=0) :
(94) log p ( x 1 = 0 ) (95) = log E η N ( 0 , σ 2 ) [ p ( η ) ] (96) = log E η N ( 0 , σ 2 ) [ p ( 0 ) + p ( 0 ) η + 0.5 p ( 0 ) η 2 + O ( η 3 ) + O ( η 4 ) ] (97) = log ( p ( 0 ) + 0.5 p ( 0 ) σ 2 + O ( σ 4 ) ) = log p ( 0 ) + 0.5 p ( 0 ) σ 2 + O ( σ 4 ) p ( 0 ) + O ( σ 4 ) = log p ( 0 ) + 0.5 σ 2 p ( 0 ) p ( 0 ) + O ( σ 4 ) (94) log p x 1 = 0 (95) = log E η N 0 , σ 2 [ p ( η ) ] (96) = log E η N 0 , σ 2 p ( 0 ) + p ( 0 ) η + 0.5 p ( 0 ) η 2 + O η 3 + O η 4 (97) = log p ( 0 ) + 0.5 p ( 0 ) σ 2 + O σ 4 = log p ( 0 ) + 0.5 p ( 0 ) σ 2 + O σ 4 p ( 0 ) + O σ 4 = log p ( 0 ) + 0.5 σ 2 p ( 0 ) p ( 0 ) + O σ 4 {:[(94)log p(x_(1)=0)],[(95)=log E_(eta∼N(0,sigma^(2)))[p(eta)]],[(96)=log E_(eta∼N(0,sigma^(2)))[p(0)+p^(')(0)eta+0.5p^('')(0)eta^(2)+O(eta^(3))+O(eta^(4))]],[(97)=log(p(0)+0.5p^('')(0)sigma^(2)+O(sigma^(4)))],[=log p(0)+(0.5p^('')(0)sigma^(2)+O(sigma^(4)))/(p(0))+O(sigma^(4))],[=log p(0)+0.5sigma^(2)(p^('')(0))/(p(0))+O(sigma^(4))]:}\begin{align*} & \log p\left(x_{1}=0\right) \tag{94}\\ & =\log \underset{\eta \sim \mathcal{N}\left(0, \sigma^{2}\right)}{\mathbb{E}}[p(\eta)] \tag{95}\\ & =\log \underset{\eta \sim \mathcal{N}\left(0, \sigma^{2}\right)}{\mathbb{E}}\left[p(0)+p^{\prime}(0) \eta+0.5 p^{\prime \prime}(0) \eta^{2}+O\left(\eta^{3}\right)+O\left(\eta^{4}\right)\right] \tag{96}\\ & =\log \left(p(0)+0.5 p^{\prime \prime}(0) \sigma^{2}+O\left(\sigma^{4}\right)\right) \tag{97}\\ & =\log p(0)+\frac{0.5 p^{\prime \prime}(0) \sigma^{2}+O\left(\sigma^{4}\right)}{p(0)}+O\left(\sigma^{4}\right) \\ & =\log p(0)+0.5 \sigma^{2} \frac{p^{\prime \prime}(0)}{p(0)}+O\left(\sigma^{4}\right) \end{align*}(94)logp(x1=0)(95)=logEηN(0,σ2)[p(η)](96)=logEηN(0,σ2)[p(0)+p(0)η+0.5p(0)η2+O(η3)+O(η4)](97)=log(p(0)+0.5p(0)σ2+O(σ4))=logp(0)+0.5p(0)σ2+O(σ4)p(0)+O(σ4)=logp(0)+0.5σ2p(0)p(0)+O(σ4)
To compute the derivatives of p p ppp, observe that:
(99) F ( x ) = log p ( x ) (100) F ( x ) = p ( x ) / p ( x ) (101) F ( x ) = p ( x ) / p ( x ) ( p ( x ) / p ( x ) ) 2 (102) = p ( x ) / p ( x ) ( F ( x ) ) 2 (103) p ( x ) / p ( x ) = F ( x ) + ( F ( x ) ) 2 (99) F ( x ) = log p ( x ) (100) F ( x ) = p ( x ) / p ( x ) (101) F ( x ) = p ( x ) / p ( x ) p ( x ) / p ( x ) 2 (102) = p ( x ) / p ( x ) F ( x ) 2 (103) p ( x ) / p ( x ) = F ( x ) + F ( x ) 2 {:[(99)F(x)=log p(x)],[(100)LongrightarrowF^(')(x)=p^(')(x)//p(x)],[(101)LongrightarrowF^('')(x)=p^('')(x)//p(x)-(p^(')(x)//p(x))^(2)],[(102)=p^('')(x)//p(x)-(F^(')(x))^(2)],[(103)Longrightarrowp^('')(x)//p(x)=F^('')(x)+(F^(')(x))^(2)]:}\begin{align*} F(x) & =\log p(x) \tag{99}\\ \Longrightarrow F^{\prime}(x) & =p^{\prime}(x) / p(x) \tag{100}\\ \Longrightarrow F^{\prime \prime}(x) & =p^{\prime \prime}(x) / p(x)-\left(p^{\prime}(x) / p(x)\right)^{2} \tag{101}\\ & =p^{\prime \prime}(x) / p(x)-\left(F^{\prime}(x)\right)^{2} \tag{102}\\ \Longrightarrow p^{\prime \prime}(x) / p(x) & =F^{\prime \prime}(x)+\left(F^{\prime}(x)\right)^{2} \tag{103} \end{align*}(99)F(x)=logp(x)(100)F(x)=p(x)/p(x)(101)F(x)=p(x)/p(x)(p(x)/p(x))2(102)=p(x)/p(x)(F(x))2(103)p(x)/p(x)=F(x)+(F(x))2
Thus, continuing from line ( 98 ) ( 98 ) (98)(98)(98) :
(104) log p ( x 1 = 0 ) = log p ( 0 ) + 0.5 σ 2 p ( 0 ) p ( 0 ) + O ( σ 4 ) = F ( 0 ) + 0.5 σ 2 ( F ( 0 ) F ( 0 ) 2 ) + O ( σ 4 ) (104) log p x 1 = 0 = log p ( 0 ) + 0.5 σ 2 p ( 0 ) p ( 0 ) + O σ 4 = F ( 0 ) + 0.5 σ 2 F ( 0 ) F ( 0 ) 2 + O σ 4 {:[(104)log p(x_(1)=0)=log p(0)+0.5sigma^(2)(p^('')(0))/(p(0))+O(sigma^(4))],[=F(0)+0.5sigma^(2)(F^('')(0)-F^(')(0)^(2))+O(sigma^(4))]:}\begin{align*} \log p\left(x_{1}=0\right) & =\log p(0)+0.5 \sigma^{2} \frac{p^{\prime \prime}(0)}{p(0)}+O\left(\sigma^{4}\right) \tag{104}\\ & =F(0)+0.5 \sigma^{2}\left(F^{\prime \prime}(0)-F^{\prime}(0)^{2}\right)+O\left(\sigma^{4}\right) \end{align*}(104)logp(x1=0)=logp(0)+0.5σ2p(0)p(0)+O(σ4)=F(0)+0.5σ2(F(0)F(0)2)+O(σ4)
We can now plug this estimate of log p ( x 1 = 0 ) log p x 1 = 0 log p(x_(1)=0)\log p\left(x_{1}=0\right)logp(x1=0) into Line (93). We omit the argument ( 0 ) from F F FFF for simplicity:
(105) K L ( q | | p ( x 0 = x 1 = 0 ) ) (106) = log p ( x 1 = 0 ) + 0.5 μ 2 σ 2 F F μ 0.5 F ( μ 2 + σ 2 ) + O ( μ 4 + σ 4 ) (107) = F + 0.5 σ 2 ( F + F 2 ) + 0.5 μ 2 σ 2 F F μ 0.5 F ( μ 2 + σ 2 ) + O ( μ 4 + σ 4 ) (108) = + 0.5 σ 2 F + 0.5 σ 2 F 2 + 0.5 μ 2 σ 2 F μ 0.5 F μ 2 0.5 F σ 2 + O ( μ 4 + σ 4 ) (109) = F μ + 0.5 μ 2 σ 2 + 0.5 F 2 σ 2 0.5 F μ 2 + O ( μ 4 + σ 4 ) (105) K L q | | p x 0 = x 1 = 0 (106) = log p x 1 = 0 + 0.5 μ 2 σ 2 F F μ 0.5 F μ 2 + σ 2 + O μ 4 + σ 4 (107) = F + 0.5 σ 2 F + F 2 + 0.5 μ 2 σ 2 F F μ 0.5 F μ 2 + σ 2 + O μ 4 + σ 4 (108) = + 0.5 σ 2 F + 0.5 σ 2 F 2 + 0.5 μ 2 σ 2 F μ 0.5 F μ 2 0.5 F σ 2 + O μ 4 + σ 4 (109) = F μ + 0.5 μ 2 σ 2 + 0.5 F 2 σ 2 0.5 F μ 2 + O μ 4 + σ 4 {:[(105)KL(q||p(x_(0)=*∣x_(1)=0))],[(106)=log p(x_(1)=0)+0.5mu^(2)sigma^(-2)-F-F^(')mu-0.5F^('')(mu^(2)+sigma^(2))+O(mu^(4)+sigma^(4))],[(107)=F+0.5sigma^(2)(F^('')+F^('2))+0.5mu^(2)sigma^(-2)-F-F^(')mu-0.5F^('')(mu^(2)+sigma^(2))+O(mu^(4)+sigma^(4))],[(108)=+0.5sigma^(2)F^('')+0.5sigma^(2)F^('2)+0.5mu^(2)sigma^(-2)-F^(')mu-0.5F^('')mu^(2)-0.5F^('')sigma^(2)+O(mu^(4)+sigma^(4))],[(109)=-F^(')mu+0.5mu^(2)sigma^(-2)+0.5F^('2)sigma^(2)-0.5F^('')mu^(2)+O(mu^(4)+sigma^(4))]:}\begin{align*} & K L\left(q|| p\left(x_{0}=\cdot \mid x_{1}=0\right)\right) \tag{105}\\ & =\log p\left(x_{1}=0\right)+0.5 \mu^{2} \sigma^{-2}-F-F^{\prime} \mu-0.5 F^{\prime \prime}\left(\mu^{2}+\sigma^{2}\right)+O\left(\mu^{4}+\sigma^{4}\right) \tag{106}\\ & =F+0.5 \sigma^{2}\left(F^{\prime \prime}+F^{\prime 2}\right)+0.5 \mu^{2} \sigma^{-2}-F-F^{\prime} \mu-0.5 F^{\prime \prime}\left(\mu^{2}+\sigma^{2}\right)+O\left(\mu^{4}+\sigma^{4}\right) \tag{107}\\ & =+0.5 \sigma^{2} F^{\prime \prime}+0.5 \sigma^{2} F^{\prime 2}+0.5 \mu^{2} \sigma^{-2}-F^{\prime} \mu-0.5 F^{\prime \prime} \mu^{2}-0.5 F^{\prime \prime} \sigma^{2}+O\left(\mu^{4}+\sigma^{4}\right) \tag{108}\\ & =-F^{\prime} \mu+0.5 \mu^{2} \sigma^{-2}+0.5 F^{\prime 2} \sigma^{2}-0.5 F^{\prime \prime} \mu^{2}+O\left(\mu^{4}+\sigma^{4}\right) \tag{109} \end{align*}(105)KL(q||p(x0=x1=0))(106)=logp(x1=0)+0.5μ2σ2FFμ0.5F(μ2+σ2)+O(μ4+σ4)(107)=F+0.5σ2(F+F2)+0.5μ2σ2FFμ0.5F(μ2+σ2)+O(μ4+σ4)(108)=+0.5σ2F+0.5σ2F2+0.5μ2σ2Fμ0.5Fμ20.5Fσ2+O(μ4+σ4)(109)=Fμ+0.5μ2σ2+0.5F2σ20.5Fμ2+O(μ4+σ4)
Up to this point, μ μ mu\muμ was arbitrary. We now set
(110) μ := F ( 0 ) σ 2 (110) μ := F ( 0 ) σ 2 {:(110)mu_(**):=F^(')(0)sigma^(2):}\begin{equation*} \mu_{*}:=F^{\prime}(0) \sigma^{2} \tag{110} \end{equation*}(110)μ:=F(0)σ2
And continue:
(111) K L ( q p ( x 0 = x 1 = 0 ) ) (112) = F μ + 0.5 μ 2 σ 2 + 0.5 F 2 σ 2 0.5 F μ 2 + O ( μ 4 + σ 4 ) (113) = F 2 σ 2 + 0.5 F 2 σ 2 + 0.5 F 2 σ 2 + O ( σ 4 ) (114) = O ( σ 4 ) (111) K L q p x 0 = x 1 = 0 (112) = F μ + 0.5 μ 2 σ 2 + 0.5 F 2 σ 2 0.5 F μ 2 + O μ 4 + σ 4 (113) = F 2 σ 2 + 0.5 F 2 σ 2 + 0.5 F 2 σ 2 + O σ 4 (114) = O σ 4 {:[(111)KL(q||p(x_(0)=*∣x_(1)=0))],[(112)=-F^(')mu_(**)+0.5mu_(**)^(2)sigma^(-2)+0.5F^('2)sigma^(2)-0.5F^('')mu_(**)^(2)+O(mu_(**)^(4)+sigma^(4))],[(113)=-F^('2)sigma^(2)+0.5F^(2)sigma^(2)+0.5F^('2)sigma^(2)+O(sigma^(4))],[(114)=O(sigma^(4))]:}\begin{align*} & K L\left(q \| p\left(x_{0}=\cdot \mid x_{1}=0\right)\right) \tag{111}\\ & =-F^{\prime} \mu_{*}+0.5 \mu_{*}^{2} \sigma^{-2}+0.5 F^{\prime 2} \sigma^{2}-0.5 F^{\prime \prime} \mu_{*}^{2}+O\left(\mu_{*}^{4}+\sigma^{4}\right) \tag{112}\\ & =-F^{\prime 2} \sigma^{2}+0.5 F^{2} \sigma^{2}+0.5 F^{\prime 2} \sigma^{2}+O\left(\sigma^{4}\right) \tag{113}\\ & =O\left(\sigma^{4}\right) \tag{114} \end{align*}(111)KL(qp(x0=x1=0))(112)=Fμ+0.5μ2σ2+0.5F2σ20.5Fμ2+O(μ4+σ4)(113)=F2σ2+0.5F2σ2+0.5F2σ2+O(σ4)(114)=O(σ4)
as desired.
Notice that our choice of μ μ mu_(**)\mu_{*}μ in the above proof was crucial; for example if we had set μ = 0 μ = 0 mu_(**)=0\mu_{*}=0μ=0, the Ω ( σ 2 ) Ω σ 2 Omega(sigma^(2))\Omega\left(\sigma^{2}\right)Ω(σ2) terms in Line (113) would not have cancelled out.

B. 2 SDE proof sketches

Here is sketch of the proof of the equivalence of the SDE and Probability Flow ODE, which relies on the equivalence of the SDE to a Fokker-Planck equation. (See Song et al. [2020] for full proof.)
Proof.
d x = f ( x , t ) d t + g ( t ) d w (FP) p t ( x ) t = x ( f p t ) + 1 2 g 2 x 2 p t ( F P ) = x ( f p t ) + 1 2 g 2 x ( p t x log p t ) = x { ( f 1 2 g 2 x log p t ) p t } = x { f ~ ( x , t ) p t ( x ) } , f ~ ( x , t ) = f ( x , t ) 1 2 g ( t ) 2 x log p t ( x ) d x = f ~ ( x , t ) d t d x = f ( x , t ) d t + g ( t ) d w (FP) p t ( x ) t = x f p t + 1 2 g 2 x 2 p t ( F P ) = x f p t + 1 2 g 2 x p t x log p t = x f 1 2 g 2 x log p t p t = x f ~ ( x , t ) p t ( x ) , f ~ ( x , t ) = f ( x , t ) 1 2 g ( t ) 2 x log p t ( x ) d x = f ~ ( x , t ) d t {:[dx=f(x","t)dt+g(t)dw],[(FP)Longleftrightarrow(delp_(t)(x))/(del t)=-grad_(x)(fp_(t))+(1)/(2)g^(2)grad_(x)^(2)p_(t)quad(FP)],[=-grad_(x)(fp_(t))+(1)/(2)g^(2)grad_(x)(p_(t)grad_(x)log p_(t))],[=-grad_(x){(f-(1)/(2)g^(2)grad_(x)log p_(t))p_(t)}],[=-grad_(x){( tilde(f))(x,t)p_(t)(x)}","quad tilde(f)(x","t)=f(x","t)-(1)/(2)g(t)^(2)grad_(x)log p_(t)(x)],[Longrightarrow dx= tilde(f)(x","t)dt]:}\begin{align*} d x & =f(x, t) d t+g(t) d w \\ \Longleftrightarrow \frac{\partial p_{t}(x)}{\partial t} & =-\nabla_{x}\left(f p_{t}\right)+\frac{1}{2} g^{2} \nabla_{x}^{2} p_{t} \quad(\mathrm{FP}) \tag{FP}\\ & =-\nabla_{x}\left(f p_{t}\right)+\frac{1}{2} g^{2} \nabla_{x}\left(p_{t} \nabla_{x} \log p_{t}\right) \\ & =-\nabla_{x}\left\{\left(f-\frac{1}{2} g^{2} \nabla_{x} \log p_{t}\right) p_{t}\right\} \\ & =-\nabla_{x}\left\{\tilde{f}(x, t) p_{t}(x)\right\}, \quad \tilde{f}(x, t)=f(x, t)-\frac{1}{2} g(t)^{2} \nabla_{x} \log p_{t}(x) \\ \Longrightarrow d x & =\tilde{f}(x, t) d t \end{align*}dx=f(x,t)dt+g(t)dw(FP)pt(x)t=x(fpt)+12g2x2pt(FP)=x(fpt)+12g2x(ptxlogpt)=x{(f12g2xlogpt)pt}=x{f~(x,t)pt(x)},f~(x,t)=f(x,t)12g(t)2xlogpt(x)dx=f~(x,t)dt
The equivalence of the SDE and Fokker-Planck equations follows from Itô's formula and integration-by-parts. Here is an outline for a simplified case in 1d, where g g ggg is constant (see Winkler [2023] for full proof):
Proof.
d x = f ( x ) d t + g d w , d w d t N ( 0 , 1 ) For any ϕ : d ϕ ( x ) = ( f ( x ) x ϕ ( x ) + 1 2 g 2 x 2 ϕ ( x ) ) d t + g x ϕ ( x ) d w d d t E [ ϕ ] = E [ f x ϕ + 1 2 g 2 x 2 ϕ ] , ( E [ d w ] = 0 ) ϕ ( x ) t p ( x , t ) d x = f ( x ) x ϕ ( x ) p ( x , t ) d x + 1 2 g 2 x 2 ϕ ( x ) p ( x , t ) d x = ϕ ( x ) x ( f ( x ) p ( x , t ) ) d x + 1 2 g 2 ϕ ( x ) x 2 p ( x , t ) d x , integration-by-parts t p ( x ) = x ( f ( x ) p ( x , t ) ) + 1 2 g 2 x 2 p ( x ) , Fokker-Planck d x = f ( x ) d t + g d w , d w d t N ( 0 , 1 )  For any  ϕ : d ϕ ( x ) = f ( x ) x ϕ ( x ) + 1 2 g 2 x 2 ϕ ( x ) d t + g x ϕ ( x ) d w d d t E [ ϕ ] = E f x ϕ + 1 2 g 2 x 2 ϕ , ( E [ d w ] = 0 ) ϕ ( x ) t p ( x , t ) d x = f ( x ) x ϕ ( x ) p ( x , t ) d x + 1 2 g 2 x 2 ϕ ( x ) p ( x , t ) d x = ϕ ( x ) x ( f ( x ) p ( x , t ) ) d x + 1 2 g 2 ϕ ( x ) x 2 p ( x , t ) d x ,  integration-by-parts  t p ( x ) = x ( f ( x ) p ( x , t ) ) + 1 2 g 2 x 2 p ( x ) ,  Fokker-Planck  {:[dx=f(x)dt+gdw","quad dw∼sqrt(dt)N(0","1)],[" For any "phi:quad d phi(x)=(f(x)del_(x)phi(x)+(1)/(2)g^(2)del_(x)^(2)phi(x))dt+gdel_(x)phi(x)dw],[Longrightarrow(d)/(dt)E[phi]=E[fdel_(x)phi+(1)/(2)g^(2)del_(x)^(2)phi]","quad(E[dw]=0)],[int phi(x)del_(t)p(x","t)dx=int f(x)del_(x)phi(x)p(x","t)dx+(1)/(2)g^(2)intdel_(x)^(2)phi(x)p(x","t)dx],[=-int phi(x)del_(x)(f(x)p(x","t))dx+(1)/(2)g^(2)int phi(x)del_(x)^(2)p(x","t)dx","" integration-by-parts "],[del_(t)p(x)=-del_(x)(f(x)p(x","t))+(1)/(2)g^(2)del_(x)^(2)p(x)","" Fokker-Planck "]:}\begin{aligned} d x & =f(x) d t+g d w, \quad d w \sim \sqrt{d t} \mathcal{N}(0,1) & & \\ \text { For any } \phi: \quad d \phi(x) & =\left(f(x) \partial_{x} \phi(x)+\frac{1}{2} g^{2} \partial_{x}^{2} \phi(x)\right) d t+g \partial_{x} \phi(x) d w & & \\ \Longrightarrow \frac{d}{d t} \mathbb{E}[\phi] & =\mathbb{E}\left[f \partial_{x} \phi+\frac{1}{2} g^{2} \partial_{x}^{2} \phi\right], \quad(\mathbb{E}[d w]=0) & & \\ \int \phi(x) \partial_{t} p(x, t) d x & =\int f(x) \partial_{x} \phi(x) p(x, t) d x+\frac{1}{2} g^{2} \int \partial_{x}^{2} \phi(x) p(x, t) d x & & \\ & =-\int \phi(x) \partial_{x}(f(x) p(x, t)) d x+\frac{1}{2} g^{2} \int \phi(x) \partial_{x}^{2} p(x, t) d x, & & \text { integration-by-parts } \\ \partial_{t} p(x) & =-\partial_{x}(f(x) p(x, t))+\frac{1}{2} g^{2} \partial_{x}^{2} p(x), & & \text { Fokker-Planck } \end{aligned}dx=f(x)dt+gdw,dwdtN(0,1) For any ϕ:dϕ(x)=(f(x)xϕ(x)+12g2x2ϕ(x))dt+gxϕ(x)dwddtE[ϕ]=E[fxϕ+12g2x2ϕ],(E[dw]=0)ϕ(x)tp(x,t)dx=f(x)xϕ(x)p(x,t)dx+12g2x2ϕ(x)p(x,t)dx=ϕ(x)x(f(x)p(x,t))dx+12g2ϕ(x)x2p(x,t)dx, integration-by-parts tp(x)=x(f(x)p(x,t))+12g2x2p(x), Fokker-Planck 

B. 3 DDIM Point-mass Claim

Here is a version of Claim 3 where p 0 p 0 p_(0)p_{0}p0 is a delta at an arbitrary point x 0 x 0 x_(0)x_{0}x0
Claim 5. Suppose the target distribution is a point mass at x 0 R d x 0 R d x_(0)inR^(d)x_{0} \in \mathbb{R}^{d}x0Rd, i.e. p 0 = δ x 0 p 0 = δ x 0 p_(0)=delta_(x_(0))p_{0}=\delta_{x_{0}}p0=δx0. Define the function
(115) G t [ x 0 ] ( x t ) = ( σ t Δ t σ t ) ( x t x 0 ) + x 0 (115) G t x 0 x t = σ t Δ t σ t x t x 0 + x 0 {:(115)G_(t)[x_(0)](x_(t))=((sigma_(t-Delta t))/(sigma_(t)))(x_(t)-x_(0))+x_(0):}\begin{equation*} G_{t}\left[x_{0}\right]\left(x_{t}\right)=\left(\frac{\sigma_{t-\Delta t}}{\sigma_{t}}\right)\left(x_{t}-x_{0}\right)+x_{0} \tag{115} \end{equation*}(115)Gt[x0](xt)=(σtΔtσt)(xtx0)+x0
Then we clearly have G t [ x 0 ] p t = p t Δ t G t x 0 p t = p t Δ t G_(t)[x_(0)]♯p_(t)=p_(t-Delta t)G_{t}\left[x_{0}\right] \sharp p_{t}=p_{t-\Delta t}Gt[x0]pt=ptΔt, and moreover
(116) G t [ x 0 ] ( x t ) = x t + λ ( E [ x t Δ t x t ] x t ) =: F t ( x t ) (116) G t x 0 x t = x t + λ E x t Δ t x t x t =: F t x t {:(116)G_(t)[x_(0)](x_(t))=x_(t)+lambda(E[x_(t-Delta t)∣x_(t)]-x_(t))=:F_(t)(x_(t)):}\begin{equation*} G_{t}\left[x_{0}\right]\left(x_{t}\right)=x_{t}+\lambda\left(\mathbb{E}\left[x_{t-\Delta t} \mid x_{t}\right]-x_{t}\right)=: F_{t}\left(x_{t}\right) \tag{116} \end{equation*}(116)Gt[x0](xt)=xt+λ(E[xtΔtxt]xt)=:Ft(xt)
Thus Algorithm 2 defines a valid reverse sampler for target distribution p 0 = δ x 0 p 0 = δ x 0 p_(0)=delta_(x_(0))p_{0}=\delta_{x_{0}}p0=δx0.

B. 4 Flow Combining Lemma

Here we provide a more formal statement of the marginal flow result stated in Equation (64).
Equation (64) follows from a more general lemma (Lemma 3) which formalizes the "gas combination" analogy of Section 3. The motivation for this lemma is, we need a way of combining flows: of taking several different flows and producing a single "effective flow."
As a warm-up for the lemma, suppose we have n n nnn different flows, each with their own initial and final distributions q i , p i q i , p i q_(i),p_(i)q_{i}, p_{i}qi,pi :
q 1 v ( 1 ) p 1 , q 2 v ( 2 ) p 2 , , q n v ( n ) p n q 1 v ( 1 ) p 1 , q 2 v ( 2 ) p 2 , , q n v ( n ) p n q_(1)↪^(v^((1)))p_(1),quadq_(2)longleftrightarrow^(v^((2)))p_(2),quad dots,quadq_(n)↪^(v^((n)))p_(n)q_{1} \stackrel{v^{(1)}}{\hookrightarrow} p_{1}, \quad q_{2} \stackrel{v^{(2)}}{\longleftrightarrow} p_{2}, \quad \ldots, \quad q_{n} \stackrel{v^{(n)}}{\hookrightarrow} p_{n}q1v(1)p1,q2v(2)p2,,qnv(n)pn
We can imagine these as the flow of n n nnn different gases, where gas i i iii has initial density q i q i q_(i)q_{i}qi and final density p i p i p_(i)p_{i}pi. Now we want to construct an overall flow v v v^(**)v^{*}v which takes the average initial-density to the average final-density:
(117) E i [ n ] [ q i ] v E i [ n ] [ p i ] (117) E i [ n ] q i v E i [ n ] p i {:(117)E_(i in[n])[q_(i)]↪^(v^(**))E_(i in[n])[p_(i)]:}\begin{equation*} \underset{i \in[n]}{\mathbb{E}}\left[q_{i}\right] \stackrel{v^{*}}{\hookrightarrow} \underset{i \in[n]}{\mathbb{E}}\left[p_{i}\right] \tag{117} \end{equation*}(117)Ei[n][qi]vEi[n][pi]
To construct v t ( x t ) v t x t v_(t)^(**)(x_(t))v_{t}^{*}\left(x_{t}\right)vt(xt), we must take an average of the individual vector fields v ( i ) v ( i ) v^((i))v^{(i)}v(i), weighted by the probability mass the i i iii-th flow places on x t x t x_(t)x_{t}xt, at time t t ttt. (This is exactly analogous to Figure 5).
This construction is formalized in Lemma 3. There, instead of averaging over just a finite set of flows, we are allowed to average over any distribution over flows. To recover Equation (64), we can apply Lemma 3 to a distribution Γ Γ Gamma\GammaΓ over ( v , q v ) = ( v [ x 1 , x 0 ] , δ x 1 ) v , q v = v x 1 , x 0 , δ x 1 (v,q_(v))=(v^([x_(1),x_(0)]),delta_(x_(1)))\left(v, q_{v}\right)=\left(v^{\left[x_{1}, x_{0}\right]}, \delta_{x_{1}}\right)(v,qv)=(v[x1,x0],δx1), that is, pointwise flows and their associated initial delta distributions.
Lemma 3 (Flow Combining Lemma). Let Γ Γ Gamma\GammaΓ be an arbitrary joint distribution over pairs ( v , q v ) v , q v (v,q_(v))\left(v, q_{v}\right)(v,qv) of flows v v vvv and their associated initial distributions q v q v q_(v)q_{v}qv. Let v ( q v ) v q v v(q_(v))v\left(q_{v}\right)v(qv) denote the final distribution when initial distribution q v q v q_(v)q_{v}qv is transported by flow v v vvv, so q v v v ( q v ) q v v v q v q_(v)↪^(v)v(q_(v))q_{v} \stackrel{v}{\hookrightarrow} v\left(q_{v}\right)qvvv(qv)
For fixed t [ 0 , 1 ] t [ 0 , 1 ] t in[0,1]t \in[0,1]t[0,1], consider the joint distribution over ( x 1 , x t , w t ) x 1 , x t , w t (x_(1),x_(t),w_(t))in\left(x_{1}, x_{t}, w_{t}\right) \in(x1,xt,wt) ( R d ) 3 R d 3 (R^(d))^(3)\left(\mathbb{R}^{d}\right)^{3}(Rd)3 generated by:
( v , q v ) Γ x 1 q v x t := RunFlow ( v , x 1 , t ) w t := v t ( x t ) . v , q v Γ x 1 q v x t := RunFlow v , x 1 , t w t := v t x t . {:[(v,q_(v))∼Gamma],[x_(1)∼q_(v)],[x_(t):=RunFlow(v,x_(1),t)],[w_(t):=v_(t)(x_(t)).]:}\begin{aligned} \left(v, q_{v}\right) & \sim \Gamma \\ x_{1} & \sim q_{v} \\ x_{t} & :=\operatorname{RunFlow}\left(v, x_{1}, t\right) \\ w_{t} & :=v_{t}\left(x_{t}\right) . \end{aligned}(v,qv)Γx1qvxt:=RunFlow(v,x1,t)wt:=vt(xt).
Then, taking all expectations w.r.t. this joint distribution, the flow v v v^(**)v^{*}v defined as
(118) v t ( x t ) := E [ w t x t ] (119) = E [ v t ( x t ) x t ] (118) v t x t := E w t x t (119) = E v t x t x t {:[(118)v_(t)^(**)(x_(t)):=E[w_(t)∣x_(t)]],[(119)=E[v_(t)(x_(t))∣x_(t)]]:}\begin{align*} v_{t}^{*}\left(x_{t}\right) & :=\mathbb{E}\left[w_{t} \mid x_{t}\right] \tag{118}\\ & =\mathbb{E}\left[v_{t}\left(x_{t}\right) \mid x_{t}\right] \tag{119} \end{align*}(118)vt(xt):=E[wtxt](119)=E[vt(xt)xt]
is known as the marginal flow for Γ Γ Gamma\GammaΓ, and transports:
(120) E [ q v ] v E [ v ( q v ) ] (120) E q v v E v q v {:(120)E[q_(v)]longleftrightarrow^(v^(**))E[v(q_(v))]:}\begin{equation*} \mathbb{E}\left[q_{v}\right] \stackrel{v^{*}}{\longleftrightarrow} \mathbb{E}\left[v\left(q_{v}\right)\right] \tag{120} \end{equation*}(120)E[qv]vE[v(qv)]

B. 5 Derivation of DDIM using Flows

Now that we have the machinery of flows in hand, it is fairly easy to derive the DDIM algorithm "from scratch", by extending our simple scaling algorithm from the single point-mass case.
First, we need to find the pointwise flow. Recall from Claim 5 that for the simple case where the target distribution p 0 p 0 p_(0)p_{0}p0 is a Dirac-delta at x 0 x 0 x_(0)x_{0}x0, the following scaling maps p t p t p_(t)p_{t}pt to p t Δ t p t Δ t p_(t-Delta t)p_{t-\Delta t}ptΔt :
G t [ x 0 ] ( x t ) = ( σ t Δ t σ t ) ( x t x 0 ) + x 0 G t p t = p t Δ t G t x 0 x t = σ t Δ t σ t x t x 0 + x 0 G t p t = p t Δ t G_(t)[x_(0)](x_(t))=((sigma_(t-Delta t))/(sigma_(t)))(x_(t)-x_(0))+x_(0)LongrightarrowG_(t)♯p_(t)=p_(t-Delta t)G_{t}\left[x_{0}\right]\left(x_{t}\right)=\left(\frac{\sigma_{t-\Delta t}}{\sigma_{t}}\right)\left(x_{t}-x_{0}\right)+x_{0} \Longrightarrow G_{t} \sharp p_{t}=p_{t-\Delta t}Gt[x0](xt)=(σtΔtσt)(xtx0)+x0Gtpt=ptΔt
G t G t G_(t)G_{t}Gt implies the pointwise flow:
lim t 0 ( σ t Δ t σ t ) = 1 Δ t t = ( 1 Δ t 2 t ) v t [ x 1 , x 0 ] ( x t ) = lim Δ t 0 G t ( x t ) x t Δ t = 1 2 t ( x t x 0 ) lim t 0 σ t Δ t σ t = 1 Δ t t = 1 Δ t 2 t v t x 1 , x 0 x t = lim Δ t 0 G t x t x t Δ t = 1 2 t x t x 0 {:[lim_(t rarr0)((sigma_(t-Delta t))/(sigma_(t)))=sqrt(1-(Delta t)/(t))=(1-(Delta t)/(2t))],[ Longrightarrowv_(t)^([x_(1),x_(0)])(x_(t))=-lim_(Delta t rarr0)(G_(t)(x_(t))-x_(t))/(Delta t)=(1)/(2t)(x_(t)-x_(0))]:}\begin{aligned} & \lim _{t \rightarrow 0}\left(\frac{\sigma_{t-\Delta t}}{\sigma_{t}}\right)=\sqrt{1-\frac{\Delta t}{t}}=\left(1-\frac{\Delta t}{2 t}\right) \\ & \Longrightarrow v_{t}^{\left[x_{1}, x_{0}\right]}\left(x_{t}\right)=-\lim _{\Delta t \rightarrow 0} \frac{G_{t}\left(x_{t}\right)-x_{t}}{\Delta t}=\frac{1}{2 t}\left(x_{t}-x_{0}\right) \end{aligned}limt0(σtΔtσt)=1Δtt=(1Δt2t)vt[x1,x0](xt)=limΔt0Gt(xt)xtΔt=12t(xtx0)
which agrees with (70).
Now let us compute the marginal flow v v v^(**)v^{*}v generated by the pointwise flow of Equation ( 70 ) and the coupling implied by the diffusion forward process. By Equation (69), the marginal flow is:
v t ( x t ) = E x 1 , x 0 x t [ v t [ x 1 , x 0 ] ( x t ) x t ] = 1 2 t E x 0 p ; x 1 x 0 + N ( 0 , σ q 2 ) [ x 0 x t x t ] x t RunFlow ( v t [ x 1 , x 0 ] , x 1 , t ) = 1 2 t E x 0 p ; x 1 x 0 + N ( 0 , σ q 2 ) [ x 0 x t x t ] x t x 1 t + ( 1 t ) x 0 = 1 2 t E x 0 p x t t N ( 0 , σ q 2 ) [ x 0 x t x t ] v t x t = E x 1 , x 0 x t v t x 1 , x 0 x t x t = 1 2 t E x 0 p ; x 1 x 0 + N 0 , σ q 2 x 0 x t x t x t RunFlow v t x 1 , x 0 , x 1 , t = 1 2 t E x 0 p ; x 1 x 0 + N 0 , σ q 2 x 0 x t x t x t x 1 t + ( 1 t ) x 0 = 1 2 t E x 0 p x t t N 0 , σ q 2 x 0 x t x t {:[v_(t)^(**)(x_(t))=E_(x_(1),x_(0)∣x_(t))[v_(t)^([x_(1),x_(0)])(x_(t))∣x_(t)]],[=(1)/(2t)E_(x_(0)∼p;x_(1)larrx_(0)+N(0,sigma_(q)^(2)))[x_(0)-x_(t)∣x_(t)]],[x_(t)larr RunFlow(v_(t)^([x_(1),x_(0)]),x_(1),t)],[=(1)/(2t)E_(x_(0)∼p;x_(1)larrx_(0)+N(0,sigma_(q)^(2)))[x_(0)-x_(t)∣x_(t)]],[x_(t)larrx_(1)sqrtt+(1-sqrtt)x_(0)],[=(1)/(2t)E_({:[x_(0)∼p],[x_(t)larrsqrttN(0,sigma_(q)^(2))]:})[x_(0)-x_(t)∣x_(t)]]:}\begin{aligned} & v_{t}^{*}\left(x_{t}\right)=\underset{x_{1}, x_{0} \mid x_{t}}{\mathbb{E}}\left[v_{t}^{\left[x_{1}, x_{0}\right]}\left(x_{t}\right) \mid x_{t}\right] \\ & =\frac{1}{2 t} \underset{x_{0} \sim p ; x_{1} \leftarrow x_{0}+\mathcal{N}\left(0, \sigma_{q}^{2}\right)}{\mathbb{E}}\left[x_{0}-x_{t} \mid x_{t}\right] \\ & x_{t} \leftarrow \operatorname{RunFlow}\left(v_{t}^{\left[x_{1}, x_{0}\right]}, x_{1}, t\right) \\ & =\frac{1}{2 t} \underset{x_{0} \sim p ; x_{1} \leftarrow x_{0}+\mathcal{N}\left(0, \sigma_{q}^{2}\right)}{\mathbb{E}}\left[x_{0}-x_{t} \mid x_{t}\right] \\ & x_{t} \leftarrow x_{1} \sqrt{t}+(1-\sqrt{t}) x_{0} \\ & =\frac{1}{2 t} \underset{\substack{x_{0} \sim p \\ x_{t} \leftarrow \sqrt{\operatorname{t} \mathcal{N}\left(0, \sigma_{q}^{2}\right)}}}{\mathbb{E}}\left[x_{0}-x_{t} \mid x_{t}\right] \end{aligned}vt(xt)=Ex1,x0xt[vt[x1,x0](xt)xt]=12tEx0p;x1x0+N(0,σq2)[x0xtxt]xtRunFlow(vt[x1,x0],x1,t)=12tEx0p;x1x0+N(0,σq2)[x0xtxt]xtx1t+(1t)x0=12tEx0pxttN(0,σq2)[x0xtxt]
This is exactly the differential equation describing the trajectory of DDIM (see Equation 58, which is the continuous-time limit of Equation 33 ).

B. 6 Two Pointwise Flows for DDIM give the same Trajectory

We want to show that pointwise flow 71 :
(121) v t [ x 1 , x 0 ] ( x t ) = 1 2 t ( x 0 x 1 ) (121) v t x 1 , x 0 x t = 1 2 t x 0 x 1 {:(121)v_(t)^([x_(1),x_(0)])(x_(t))=(1)/(2sqrtt)(x_(0)-x_(1)):}\begin{equation*} v_{t}^{\left[x_{1}, x_{0}\right]}\left(x_{t}\right)=\frac{1}{2 \sqrt{t}}\left(x_{0}-x_{1}\right) \tag{121} \end{equation*}(121)vt[x1,x0](xt)=12t(x0x1)
is equivalent to the DDIM pointwise flow (70):
(122) v t [ x 1 , x 0 ] ( x t ) = 1 2 t ( x t x 0 ) (122) v t x 1 , x 0 x t = 1 2 t x t x 0 {:(122)v_(t)^([x_(1),x_(0)])(x_(t))=(1)/(2t)(x_(t)-x_(0)):}\begin{equation*} v_{t}^{\left[x_{1}, x_{0}\right]}\left(x_{t}\right)=\frac{1}{2 t}\left(x_{t}-x_{0}\right) \tag{122} \end{equation*}(122)vt[x1,x0](xt)=12t(xtx0)
because both these pointwise flows generate the same trajectory of x t x t x_(t)x_{t}xt :
(123) x t = x 0 + ( x 1 x 0 ) t (123) x t = x 0 + x 1 x 0 t {:(123)x_(t)=x_(0)+(x_(1)-x_(0))sqrtt:}\begin{equation*} x_{t}=x_{0}+\left(x_{1}-x_{0}\right) \sqrt{t} \tag{123} \end{equation*}(123)xt=x0+(x1x0)t
To see this, we can solve the ODE determined by (70) via the Separable Equations method:
d x t d t = 1 2 t ( x 0 x t ) d x t d t x t x 0 = 1 2 t 1 x t x 0 d x = 1 2 t d t , since d x t d t d t = d x log ( x t x 0 ) = log t + c c = log ( x 1 x 0 ) (boundary cond.) log ( x t x 0 ) = log t ( x 1 x 0 ) x t x 0 = t ( x 1 x 0 ) d x t d t = 1 2 t x 0 x t d x t d t x t x 0 = 1 2 t 1 x t x 0 d x = 1 2 t d t ,  since  d x t d t d t = d x log x t x 0 = log t + c c = log x 1 x 0  (boundary cond.)  log x t x 0 = log t x 1 x 0 x t x 0 = t x 1 x 0 {:[(dx_(t))/(dt)=-(1)/(2t)(x_(0)-x_(t))],[Longrightarrow((dx_(t))/(dt))/(x_(t)-x_(0))=(1)/(2t)],[Longrightarrow int(1)/(x_(t)-x_(0))dx=int(1)/(2t)dt","" since "(dx_(t))/(dt)dt=dx],[Longrightarrow log(x_(t)-x_(0))=log sqrtt+c],[c=log(x_(1)-x_(0))" (boundary cond.) "],[Longrightarrow log(x_(t)-x_(0))=log sqrtt(x_(1)-x_(0))],[Longrightarrowx_(t)-x_(0)=sqrtt(x_(1)-x_(0))]:}\begin{aligned} \frac{d x_{t}}{d t} & =-\frac{1}{2 t}\left(x_{0}-x_{t}\right) \\ \Longrightarrow \frac{\frac{d x_{t}}{d t}}{x_{t}-x_{0}} & =\frac{1}{2 t} \\ \Longrightarrow \int \frac{1}{x_{t}-x_{0}} d x & =\int \frac{1}{2 t} d t, \text { since } \frac{d x_{t}}{d t} d t=d x \\ \Longrightarrow \log \left(x_{t}-x_{0}\right) & =\log \sqrt{t}+c \\ c & =\log \left(x_{1}-x_{0}\right) \text { (boundary cond.) } \\ \Longrightarrow \log \left(x_{t}-x_{0}\right) & =\log \sqrt{t}\left(x_{1}-x_{0}\right) \\ \Longrightarrow x_{t}-x_{0} & =\sqrt{t}\left(x_{1}-x_{0}\right) \end{aligned}dxtdt=12t(x0xt)dxtdtxtx0=12t1xtx0dx=12tdt, since dxtdtdt=dxlog(xtx0)=logt+cc=log(x1x0) (boundary cond.) log(xtx0)=logt(x1x0)xtx0=t(x1x0)

B. 7 DDIM vs Time-reparameterized linear flows

Lemma 4 (DDIM vs Linear Flows). Let p 0 p 0 p_(0)p_{0}p0 be an arbitrary target distribution. Let { x t } t x t t {x_(t)}_(t)\left\{x_{t}\right\}_{t}{xt}t be the joint distribution defined by the DDPM forward process applied to p 0 p 0 p_(0)p_{0}p0, so the marginal distribution of x t x t x_(t)x_{t}xt is p t = p t = p_(t)=p_{t}=pt= p N ( 0 , t σ q 2 ) p N 0 , t σ q 2 p***N(0,tsigma_(q)^(2))p \star \mathcal{N}\left(0, t \sigma_{q}^{2}\right)pN(0,tσq2).
Let x R d x R d x^(**)inR^(d)x^{*} \in \mathbb{R}^{d}xRd be an arbitrary initial point. Consider the following two deterministic trajectories:
  1. The trajectory { y t } t y t t {y_(t)}_(t)\left\{y_{t}\right\}_{t}{yt}t of the continuous-time DDIM flow, with respect to target distribution p 0 p 0 p_(0)p_{0}p0, when started at initial point y 1 = x y 1 = x y_(1)=x^(**)y_{1}=x^{*}y1=x.
That is, y t y t y_(t)y_{t}yt is the solution to the following ODE (Equation 58):
(124) d y t d t = v d d i m ( y t ) (125) = 1 2 t E x 0 x t [ x 0 x t x t = y t ] (124) d y t d t = v d d i m y t (125) = 1 2 t E x 0 x t x 0 x t x t = y t {:[(124)(dy_(t))/(dt)=-v^(ddim)(y_(t))],[(125)=-(1)/(2t)E_(x_(0)∣x_(t))[x_(0)-x_(t)∣x_(t)=y_(t)]]:}\begin{align*} \frac{d y_{t}}{d t} & =-v^{\mathrm{ddim}}\left(y_{t}\right) \tag{124}\\ & =-\frac{1}{2 t} \underset{x_{0} \mid x_{t}}{\mathbb{E}}\left[x_{0}-x_{t} \mid x_{t}=y_{t}\right] \tag{125} \end{align*}(124)dytdt=vddim(yt)(125)=12tEx0xt[x0xtxt=yt]
with boundary condition y 1 y 1 y_(1)y_{1}y1 at t = 1 t = 1 t=1t=1t=1.
  1. The trajectory { z t } t z t t {z_(t)}_(t)\left\{z_{t}\right\}_{t}{zt}t produced when initial point z 1 = x z 1 = x z_(1)=x^(**)z_{1}=x^{*}z1=x is transported by the marginal flow constructed from:
  • Linear pointwise flows
  • The DDPM-coupling of Line (73).
That is, the marginal flow
v t ( x t ) = E x 0 , x 1 x t [ v [ x 1 , x 0 ] ( x t ) x t ] := E x 0 , x 1 x t [ x 0 x 1 x t ] = E x 0 x t [ x 0 x t x t ] v t x t = E x 0 , x 1 x t v x 1 , x 0 x t x t := E x 0 , x 1 x t x 0 x 1 x t = E x 0 x t x 0 x t x t {:[v_(t)^(***)(x_(t))=E_(x_(0),x_(1)∣x_(t))[v^([x_(1),x_(0)])(x_(t))∣x_(t)]],[:=E_(x_(0),x_(1)∣x_(t))[x_(0)-x_(1)∣x_(t)]],[=E_(x_(0)∣x_(t))[x_(0)-x_(t)∣x_(t)]]:}\begin{aligned} v_{t}^{\star}\left(x_{t}\right) & =\underset{x_{0}, x_{1} \mid x_{t}}{\mathbb{E}}\left[v^{\left[x_{1}, x_{0}\right]}\left(x_{t}\right) \mid x_{t}\right] \\ & :=\underset{x_{0}, x_{1} \mid x_{t}}{\mathbb{E}}\left[x_{0}-x_{1} \mid x_{t}\right] \\ & =\underset{x_{0} \mid x_{t}}{\mathbb{E}}\left[x_{0}-x_{t} \mid x_{t}\right] \end{aligned}vt(xt)=Ex0,x1xt[v[x1,x0](xt)xt]:=Ex0,x1xt[x0x1xt]=Ex0xt[x0xtxt]
since E [ x 1 x t ] = x t E x 1 x t = x t E[x_(1)∣x_(t)]=x_(t)\mathbb{E}\left[x_{1} \mid x_{t}\right]=x_{t}E[x1xt]=xt under the DDPM coupling.
Then, we claim these two trajectories are identical with the following timereparameterization:
(126) t [ 0 , 1 ] : y t = z t (126) t [ 0 , 1 ] : y t = z t {:(126)AA t in[0","1]:quady_(t)=z_(sqrtt):}\begin{equation*} \forall t \in[0,1]: \quad y_{t}=z_{\sqrt{t}} \tag{126} \end{equation*}(126)t[0,1]:yt=zt

B. 8 Proof Sketch of Claim 2

We will show that, in the forward diffusion setup of Section 1:
(127) E [ ( x t x t Δ t ) x t ] = Δ t t E [ ( x t x 0 ) x t ] (127) E x t x t Δ t x t = Δ t t E x t x 0 x t {:(127)E[(x_(t)-x_(t-Delta t))∣x_(t)]=(Delta t)/(t)E[(x_(t)-x_(0))∣x_(t)]:}\begin{equation*} \mathbb{E}\left[\left(x_{t}-x_{t-\Delta t}\right) \mid x_{t}\right]=\frac{\Delta t}{t} \mathbb{E}\left[\left(x_{t}-x_{0}\right) \mid x_{t}\right] \tag{127} \end{equation*}(127)E[(xtxtΔt)xt]=ΔttE[(xtx0)xt]
Proof sketch. Recall η t = x t + Δ t x t η t = x t + Δ t x t eta_(t)=x_(t+Delta t)-x_(t)\eta_{t}=x_{t+\Delta t}-x_{t}ηt=xt+Δtxt. So by linearity of expectation:
(128) E [ ( x t x 0 ) x t ] = E [ i < t η i x t ] (129) = i < t E [ η i x t ] (128) E x t x 0 x t = E i < t η i x t (129) = i < t E η i x t {:[(128)E[(x_(t)-x_(0))∣x_(t)]=E[sum_(i < t)eta_(i)∣x_(t)]],[(129)=sum_(i < t)E[eta_(i)∣x_(t)]]:}\begin{align*} \mathbb{E}\left[\left(x_{t}-x_{0}\right) \mid x_{t}\right] & =\mathbb{E}\left[\sum_{i<t} \eta_{i} \mid x_{t}\right] \tag{128}\\ & =\sum_{i<t} \mathbb{E}\left[\eta_{i} \mid x_{t}\right] \tag{129} \end{align*}(128)E[(xtx0)xt]=E[i<tηixt](129)=i<tE[ηixt]
Now, we claim that for given x t x t x_(t)x_{t}xt, the conditional distributions p ( η i p η i p(eta_(i)∣:}p\left(\eta_{i} \mid\right.p(ηi x t x t x_(t)x_{t}xt ) are identical for all i < t i < t i < ti<ti<t. To see this, notice that the joint distribution function p ( x 0 , x t , η 0 , η Δ t , , η t Δ t ) p x 0 , x t , η 0 , η Δ t , , η t Δ t p(x_(0),x_(t),eta_(0),eta_(Delta t),dots,eta_(t-Delta t))p\left(x_{0}, x_{t}, \eta_{0}, \eta_{\Delta t}, \ldots, \eta_{t-\Delta t}\right)p(x0,xt,η0,ηΔt,,ηtΔt) is symmetric in the { η i } s η i s {eta_(i)}s\left\{\eta_{i}\right\} \mathrm{s}{ηi}s, by definition of the forward process, and therefore the conditional distribution function p ( η 0 , η Δ t , , η t Δ t x t ) p η 0 , η Δ t , , η t Δ t x t p(eta_(0),eta_(Delta t),dots,eta_(t-Delta t)∣x_(t))p\left(\eta_{0}, \eta_{\Delta t}, \ldots, \eta_{t-\Delta t} \mid x_{t}\right)p(η0,ηΔt,,ηtΔtxt) is also symmetric in the { η i } s η i s {eta_(i)}s\left\{\eta_{i}\right\} \mathrm{s}{ηi}s. Therefore, all η i η i eta_(i)\eta_{i}ηi have identical conditional expectations:
(130) E [ η 0 x t ] = E [ η Δ t x t ] = = E [ η t Δ t x t ] (130) E η 0 x t = E η Δ t x t = = E η t Δ t x t {:(130)E[eta_(0)∣x_(t)]=E[eta_(Delta t)∣x_(t)]=cdots=E[eta_(t-Delta t)∣x_(t)]:}\begin{equation*} \mathbb{E}\left[\eta_{0} \mid x_{t}\right]=\mathbb{E}\left[\eta_{\Delta t} \mid x_{t}\right]=\cdots=\mathbb{E}\left[\eta_{t-\Delta t} \mid x_{t}\right] \tag{130} \end{equation*}(130)E[η0xt]=E[ηΔtxt]==E[ηtΔtxt]
And since there are ( t / Δ t ) ( t / Δ t ) (t//Delta t)(t / \Delta t)(t/Δt) of them,
(131) i < t E [ η i x t ] = t Δ t E [ η t Δ t x t ] (131) i < t E η i x t = t Δ t E η t Δ t x t {:(131)sum_(i < t)E[eta_(i)∣x_(t)]=(t)/(Delta t)E[eta_(t-Delta t)∣x_(t)]:}\begin{equation*} \sum_{i<t} \mathbb{E}\left[\eta_{i} \mid x_{t}\right]=\frac{t}{\Delta t} \mathbb{E}\left[\eta_{t-\Delta t} \mid x_{t}\right] \tag{131} \end{equation*}(131)i<tE[ηixt]=tΔtE[ηtΔtxt]
Now continuing from Line 129,
(132) E [ ( x 0 x t ) x t ] = i < t E [ η i x t ] (133) = ( t / Δ t ) E [ η t Δ t x t ] (134) = ( t / Δ t ) E [ ( x t x t Δ t ) x t ] (132) E x 0 x t x t = i < t E η i x t (133) = ( t / Δ t ) E η t Δ t x t (134) = ( t / Δ t ) E x t x t Δ t x t {:[(132)E[(x_(0)-x_(t))∣x_(t)]=sum_(i < t)E[eta_(i)∣x_(t)]],[(133)=(t//Delta t)E[eta_(t-Delta t)∣x_(t)]],[(134)=(t//Delta t)E[(x_(t)-x_(t-Delta t))∣x_(t)]]:}\begin{align*} \mathbb{E}\left[\left(x_{0}-x_{t}\right) \mid x_{t}\right] & =\sum_{i<t} \mathbb{E}\left[\eta_{i} \mid x_{t}\right] \tag{132}\\ & =(t / \Delta t) \mathbb{E}\left[\eta_{t-\Delta t} \mid x_{t}\right] \tag{133}\\ & =(t / \Delta t) \mathbb{E}\left[\left(x_{t}-x_{t-\Delta t}\right) \mid x_{t}\right] \tag{134} \end{align*}(132)E[(x0xt)xt]=i<tE[ηixt](133)=(t/Δt)E[ηtΔtxt](134)=(t/Δt)E[(xtxtΔt)xt]
as desired.

B. 9 Variance-Reduced Algorithms

Here we give the "varianced-reduced" versions of the DDPM training and sampling algorithms, where we train a network g θ g θ g_(theta)g_{\theta}gθ to approximate
(135) g θ ( x , t ) E [ x 0 x t ] (135) g θ ( x , t ) E x 0 x t {:(135)g_(theta)(x","t)~~E[x_(0)∣x_(t)]:}\begin{equation*} g_{\theta}(x, t) \approx \mathbb{E}\left[x_{0} \mid x_{t}\right] \tag{135} \end{equation*}(135)gθ(x,t)E[x0xt]
instead of a network f θ f θ f_(theta)f_{\theta}fθ to approximate
(136) f θ ( x , t ) E [ x t Δ t x t ] (136) f θ ( x , t ) E x t Δ t x t {:(136)f_(theta)(x","t)~~E[x_(t-Delta t)∣x_(t)]:}\begin{equation*} f_{\theta}(x, t) \approx \mathbb{E}\left[x_{t-\Delta t} \mid x_{t}\right] \tag{136} \end{equation*}(136)fθ(x,t)E[xtΔtxt]
Via Claim 2, these two functions are equivalent via the transform:
(137) f θ ( x , t ) = ( Δ t / t ) g θ ( x , t ) + ( 1 Δ t / t ) x . (137) f θ ( x , t ) = ( Δ t / t ) g θ ( x , t ) + ( 1 Δ t / t ) x {:(137)f_(theta)(x","t)=(Delta t//t)g_(theta)(x","t)+(1-Delta t//t)x". ":}\begin{equation*} f_{\theta}(x, t)=(\Delta t / t) g_{\theta}(x, t)+(1-\Delta t / t) x \text {. } \tag{137} \end{equation*}(137)fθ(x,t)=(Δt/t)gθ(x,t)+(1Δt/t)x
Plugging this relation into Pseudocode 2 yields the variance-reduced DDPM sampler of Pseudocode 7 .
Pseudocode 6: DDPM train loss $\left(x_{0}-\right.$
prediction)
    Input: Neural network $g_{\theta} ;$ Sample-access to
        train distribution $p$.
    Data: Terminal variance $\sigma_{q}$
    Output: Stochastic loss $L$
    $x_{0} \leftarrow \operatorname{Sample}(p)$
    $t \leftarrow \operatorname{Unif}[0,1]$
    $x_{t} \leftarrow x_{0}+\mathcal{N}\left(0, \sigma_{q}^{2} t\right)$
    $L \leftarrow\left\|g_{\theta}\left(x_{t}, t\right)-x_{0}\right\|_{2}^{2}$
    return $L$
Pseudocode 7: DDPM sampling ( $x_{0}-$
prediction)
    Input: Trained model $f_{\theta}$.
    Data: Terminal variance $\sigma_{q}$; step-size $\Delta t$.
    Output: $x_{0}$
    $x_{1} \leftarrow \mathcal{N}\left(0, \sigma_{q}^{2}\right)$
    for $t=1,(1-\Delta t),(1-2 \Delta t), \ldots, \Delta t$ do
        $\widehat{\eta_{t}} \leftarrow g_{\theta}\left(x_{t}, t\right)-x_{t}$
        $x_{t-\Delta t} \leftarrow x_{t}+(1 / t) \widehat{\eta}_{t} \Delta t+\mathcal{N}\left(0, \sigma_{q}^{2} \Delta t\right)$
    end
    return $x_{0}$

B. 10 Equivalence of and x 0 x 0 x_(0)x_{0}x0 - and ε ε epsi\varepsilonε-prediction

We will discuss this in our usual simplified setup:
x t = x 0 + σ t ε t , σ t = σ q t , ε t N ( 0 , 1 ) x t = x 0 + σ t ε t , σ t = σ q t , ε t N ( 0 , 1 ) x_(t)=x_(0)+sigma_(t)epsi_(t),quadsigma_(t)=sigma_(q)sqrtt,quadepsi_(t)∼N(0,1)x_{t}=x_{0}+\sigma_{t} \varepsilon_{t}, \quad \sigma_{t}=\sigma_{q} \sqrt{t}, \quad \varepsilon_{t} \sim \mathcal{N}(0,1)xt=x0+σtεt,σt=σqt,εtN(0,1)
the scaling factors are more complex in the general case (see Luo [2022] for VP diffusion, for example) but the idea is the same. The DDPM training algorithm 1 has objective and optimal value
min θ f θ ( x t , t ) x t Δ t 2 2 , f θ ( x t , t ) = E [ x t Δ t x t ] min θ f θ x t , t x t Δ t 2 2 , f θ x t , t = E x t Δ t x t min_(theta)||f_(theta)(x_(t),t)-x_(t-Delta t)||_(2)^(2),quadf_(theta)^(***)(x_(t),t)=E[x_(t-Delta t)∣x_(t)]\min _{\theta}\left\|f_{\theta}\left(x_{t}, t\right)-x_{t-\Delta t}\right\|_{2}^{2}, \quad f_{\theta}^{\star}\left(x_{t}, t\right)=\mathbb{E}\left[x_{t-\Delta t} \mid x_{t}\right]minθfθ(xt,t)xtΔt22,fθ(xt,t)=E[xtΔtxt]
That is, the network f θ f θ f_(theta)f_{\theta}fθ to learn to predict E [ x t Δ t x t ] E x t Δ t x t E[x_(t-Delta t)∣x_(t)]\mathbb{E}\left[x_{t-\Delta t} \mid x_{t}\right]E[xtΔtxt]. However, we could instead require the network to predict other related quantities, as follows. Noting that
E [ x t Δ t x t x t ] = eq. 23 Δ t t E [ x 0 x t x t ] Δ t t σ t E [ ε t x t ] E [ x t Δ t x t x t ] x t Δ t 2 2 = Δ t t ( E [ x 0 x t ] x 0 ) 2 2 = Δ t t σ t ( E [ ε t x t ] ε t ) 2 2 E x t Δ t x t x t =  eq.  23 Δ t t E x 0 x t x t Δ t t σ t E ε t x t E x t Δ t x t x t x t Δ t 2 2 = Δ t t E x 0 x t x 0 2 2 = Δ t t σ t E ε t x t ε t 2 2 {:[E[x_(t-Delta t)-x_(t)∣x_(t)]=^(" eq. "23)(Delta t)/(t)E[x_(0)-x_(t)∣x_(t)]-=(Delta t)/(tsigma_(t))E[epsi_(t)∣x_(t)]],[Longrightarrow||E[x_(t-Delta t)-x_(t)∣x_(t)]-x_(t-Delta t)||_(2)^(2)=||(Delta t)/(t)(E[x_(0)∣x_(t)]-x_(0))||_(2)^(2)=||(Delta t)/(tsigma_(t))(E[epsi_(t)∣x_(t)]-epsi_(t))||_(2)^(2)]:}\begin{aligned} & \mathbb{E}\left[x_{t-\Delta t}-x_{t} \mid x_{t}\right] \stackrel{\text { eq. } 23}{=} \frac{\Delta t}{t} \mathbb{E}\left[x_{0}-x_{t} \mid x_{t}\right] \equiv \frac{\Delta t}{t \sigma_{t}} \mathbb{E}\left[\varepsilon_{t} \mid x_{t}\right] \\ \Longrightarrow & \left\|E\left[x_{t-\Delta t}-x_{t} \mid x_{t}\right]-x_{t-\Delta t}\right\|_{2}^{2}=\left\|\frac{\Delta t}{t}\left(\mathbb{E}\left[x_{0} \mid x_{t}\right]-x_{0}\right)\right\|_{2}^{2}=\left\|\frac{\Delta t}{t \sigma_{t}}\left(\mathbb{E}\left[\varepsilon_{t} \mid x_{t}\right]-\varepsilon_{t}\right)\right\|_{2}^{2} \end{aligned}E[xtΔtxtxt]= eq. 23ΔttE[x0xtxt]ΔttσtE[εtxt]E[xtΔtxtxt]xtΔt22=Δtt(E[x0xt]x0)22=Δttσt(E[εtxt]εt)22
we get the following equivalent problems:
min θ f θ ( x t , t ) x 0 2 2 f θ ( x t , t ) = E [ x 0 x t ] , min θ f θ x t , t x 0 2 2 f θ x t , t = E x 0 x t , min_(theta)||f_(theta)(x_(t),t)-x_(0)||_(2)^(2)Longrightarrowf_(theta)^(***)(x_(t),t)=E[x_(0)∣x_(t)],quad\min _{\theta}\left\|f_{\theta}\left(x_{t}, t\right)-x_{0}\right\|_{2}^{2} \Longrightarrow f_{\theta}^{\star}\left(x_{t}, t\right)=\mathbb{E}\left[x_{0} \mid x_{t}\right], \quadminθfθ(xt,t)x022fθ(xt,t)=E[x0xt], time-weighting = 1 t = 1 t =(1)/(t)=\frac{1}{t}=1t
min θ Δ t t σ t ( f θ ( x t , t ) ε t ) 2 2 f θ ( x t , t ) = E [ ε t x t ] min θ Δ t t σ t f θ x t , t ε t 2 2 f θ x t , t = E ε t x t min_(theta)||(Delta t)/(tsigma_(t))(f_(theta)(x_(t),t)-epsi_(t))||_(2)^(2)Longrightarrowf_(theta)^(***)(x_(t),t)=E[epsi_(t)∣x_(t)]quad\min _{\theta}\left\|\frac{\Delta t}{t \sigma_{t}}\left(f_{\theta}\left(x_{t}, t\right)-\varepsilon_{t}\right)\right\|_{2}^{2} \Longrightarrow f_{\theta}^{\star}\left(x_{t}, t\right)=\mathbb{E}\left[\varepsilon_{t} \mid x_{t}\right] \quadminθΔttσt(fθ(xt,t)εt)22fθ(xt,t)=E[εtxt] time-weighting = 1 t σ t = 1 t σ t =(1)/(tsigma_(t))=\frac{1}{t \sigma_{t}}=1tσt.

References

Madhu S Advani, Andrew M Saxe, and Haim Sompolinsky. High-dimensional dynamics of generalization error in neural networks. Neural Networks, 132:428-446, 2020. 34 34 uarr34\uparrow 3434
Michael S. Albergo, Nicholas M. Boffi, and Eric Vanden-Eijnden. Stochastic interpolants: A unifying framework for flows and diffusions, 2023. 25 , 31 , 35 , 37 25 , 31 , 35 , 37 uarr25,uarr31,uarr35,uarr37\uparrow 25, \uparrow 31, \uparrow 35, \uparrow 3725,31,35,37
Michael Samuel Albergo and Eric Vanden-Eijnden. Building normalizing flows with stochastic interpolants. In The Eleventh International Conference on Learning Representations, 2022. 31 31 uarr31\uparrow 3131
Brian DO Anderson. Reverse-time diffusion equation models. Stochastic Processes and their Applications, 12 (3):313-326, 1982. 14 14 uarr14\uparrow 1414
Nicolas Carlini, Jamie Hayes, Milad Nasr, Matthew Jagielski, Vikash Sehwag, Florian Tramer, Borja Balle, Daphne Ippolito, and Eric Wallace. Extracting training data from diffusion models. In 32nd USENIX Security Symposium (USENIX Security 23), pages 5253-5270, 2023. 23 23 uarr23\uparrow 2323
Stanley H. Chan. Tutorial on diffusion models for imaging and vision, 2024. 36 36 uarr36\uparrow 3636
Hongrui Chen, Holden Lee, and Jianfeng Lu. Improved analysis of score-based generative modeling: Userfriendly bounds under minimal smoothness assumptions. In International Conference on Machine Learning, pages 4735 4763 4735 4763 4735-47634735-476347354763. PMLR, 2023. 23 23 uarr23\uparrow 2323
Sitan Chen, Sinho Chewi, Jerry Li, Yuanzhi Li, Adil Salim, and Anru Zhang. Sampling is as easy as learning the score: theory for diffusion models with minimal data assumptions. In The Eleventh International Conference on Learning Representations, 2022. 23 23 uarr23\uparrow 2323
Sitan Chen, Sinho Chewi, Holden Lee, Yuanzhi Li, Jianfeng Lu, and Adil Salim. The probability flow ode is provably fast. Advances in Neural Information Processing Systems, 36, 2024a. 23 23 uarr23\uparrow 2323
Sitan Chen, Vasilis Kontonis, and Kulin Shah. Learning general gaussian mixtures with efficient score matching. arXiv preprint arXiv:2404.18893, 2024b. 23 23 uarr23\uparrow 2323
Ayan Das. Building diffusion model's theory from ground up. In ICLR Blogposts 2024, 2024. URL https://iclr-blogposts.github.io/2024/blog/diffusion-theory-from-scratch/. https://iclrblogposts.github.io/2024/blog/diffusion-theory-from-scratch/. 36 36 uarr36\uparrow 3636
Valentin De Bortoli. Convergence of denoising diffusion models under the manifold hypothesis. arXiv preprint arXiv:2208.05314, 2022. 23 23 uarr23\uparrow 2323
Valentin De Bortoli, James Thornton, Jeremy Heng, and Arnaud Doucet. Diffusion schrödinger bridge with applications to score-based generative modeling. Advances in Neural Information Processing Systems, 34: 17695-17709, 2021. 23 23 uarr23\uparrow 2323
Sander Dieleman. Perspectives on diffusion, 2023. URL https://sander.ai/2023/07/20/perspectives. html. 36 36 uarr36\uparrow 3636
Tony Duan. Diffusion models from scratch, 2023. URL https://www.tonyduan.com/diffusion/index.html. 36 36 uarr36\uparrow 3636
Ronen Eldan. Lecture notes - from stochastic calculus to geometric inequalities, 2024. URL https://www. wisdom.weizmann.ac.il/ ronene/GFANotes.pdf. 13 13 uarr13\uparrow 1313
Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. arXiv preprint arXiv:2403.03206, 2024. 25 25 uarr25\uparrow 2525
Lawrence C Evans. An introduction to stochastic differential equations, volume 82. American Mathematical Soc., 2012. 13 13 uarr13\uparrow 1313
Tor Fjelde, Emile Mathieu, and Vincent Dutordoir. An introduction to flow matching, January 2024. URL https://mlg.eng.cam.ac.uk/blog/2024/01/20/flow-matching.html. 31 , 37 31 , 37 uarr31,uarr37\uparrow 31, \uparrow 3731,37
Xiangming Gu, Chao Du, Tianyu Pang, Chongxuan Li, Min Lin, and Ye Wang. On memorization in diffusion models. arXiv preprint arXiv:2310.02664, 2023. 23 23 uarr23\uparrow 2323
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33 : 6840 6851 , 2020 . 3 , 8 , 32 , 33 33 : 6840 6851 , 2020 . 3 , 8 , 32 , 33 33:6840-6851,2020.uarr3,uarr8,uarr32,uarr3333: 6840-6851,2020 . \uparrow 3, \uparrow 8, \uparrow 32, \uparrow 3333:68406851,2020.3,8,32,33
Zahra Kadkhodaie, Florentin Guth, Eero P Simoncelli, and Stéphane Mallat. Generalization in diffusion models arises from geometry-adaptive harmonic representations. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview. net/forum?id=ANvmVS2Yr0. 33 33 uarr33\uparrow 3333
Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based generative models, 2022. 12 , 14 , 32 , 36 , 37 12 , 14 , 32 , 36 , 37 uarr12,uarr14,uarr32,uarr36,uarr37\uparrow 12, \uparrow 14, \uparrow 32, \uparrow 36, \uparrow 3712,14,32,36,37
Diederik P Kingma and Ruiqi Gao. Understanding diffusion objectives as the ELBO with simple data augmentation. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https: //openreview.net/forum?id=NnMEadcdyD. 33 33 uarr33\uparrow 3333
P.E. Kloeden and E. Platen. Numerical Solution of Stochastic Differential Equations. Stochastic Modelling and Applied Probability. Springer Berlin Heidelberg, 2011. ISBN 9783540540625. URL https://books. google.com/books?id=BCvtssom1CMC. 13 13 uarr13\uparrow 1313
Holden Lee, Jianfeng Lu, and Yixin Tan. Convergence of score-based generative modeling for general data distributions. In International Conference on Algorithmic Learning Theory, pages 946-985. PMLR, 2023. 23 23 uarr23\uparrow 2323
Shanchuan Lin, Anran Wang, and Xiao Yang. Sdxl-lightning: Progressive adversarial diffusion distillation. arXiv preprint arXiv:2402.13929, 2024. 32 32 uarr32\uparrow 3232
Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matthew Le. Flow matching for generative modeling. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=PqvMRDCJT9t. 25 , 28 25 , 28 uarr25,uarr28\uparrow 25, \uparrow 2825,28
Xingchao Liu, Chengyue Gong, et al. Flow straight and fast: Learning to generate and transfer data with
Xingchao Liu, Lemeng Wu, Mao Ye, and Qiang Liu. Let us build bridges: Understanding and extending diffusion generative models, 2022b. 25 25 uarr25\uparrow 2525
Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around to steps. Advances in Neural Information Processing Systems, 35:5775-5787, 2022a. 32 32 uarr32\uparrow 3232
Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. Dpm-solver++: Fast solver for guided sampling of diffusion probabilistic models. arXiv preprint arXiv:2211.01095, 2022b. 32 32 uarr32\uparrow 3232
Calvin Luo. Understanding diffusion models: A unified perspective, 2022. 33 , 36 , 46 33 , 36 , 46 uarr33,uarr36,uarr46\uparrow 33, \uparrow 36, \uparrow 4633,36,46
David McAllester. On the mathematics of diffusion models, 2023. 36 36 uarr36\uparrow 3636
Andrea Montanari. Sampling, diffusions, and stochastic localization, 2023. 37 37 uarr37\uparrow 3737
Stefano Peluchetti. Non-denoising forward-time diffusions, 2022. URL https://openreview.net/forum? id=oVfIKuhqfC. 25 25 uarr25\uparrow 2525
Frank Permenter and Chenyang Yuan. Interpreting and improving diffusion models using the euclidean distance function. arXiv preprint arXiv:2306.04848, 2023. 36 36 uarr36\uparrow 3636
Gabriel Peyré. Denoising diffusion models, 2023. URL https://mathematical-tours.github.io/ book-sources/optim-ml/OptimML-DiffusionModels.pdf. 37 37 uarr37\uparrow 3737
Aram-Alexandre Pooladian, Heli Ben-Hamu, Carles Domingo-Enrich, Brandon Amos, Yaron Lipman, and Ricky TQ Chen. Multisample flow matching: Straightening flows with minibatch couplings. In International Conference on Machine Learning, pages 28100-28127. PMLR, 2023. 31 31 uarr31\uparrow 3131
Fabio De Sousa Ribeiro and Ben Glocker. Demystifying variational diffusion models, 2024. 36 36 uarr36\uparrow 3636
Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models. arXiv preprint arXiv:2202.00512, 2022. 34 34 uarr34\uparrow 3434
Axel Sauer, Frederic Boesel, Tim Dockhorn, Andreas Blattmann, Patrick Esser, and Robin Rombach. Fast high-resolution image synthesis with latent adversarial diffusion distillation. arXiv preprint arXiv:2403.12015, 2024. 32 32 uarr32\uparrow 3232
Jascha Sohl-Dickstein, Eric A. Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. CoRR, abs/1503.03585, 2015. URL http://arxiv.org/ abs / 1503.03585 . 4 , 8 , 10 , 33 , 36 / 1503.03585 . 4 , 8 , 10 , 33 , 36 //1503.03585.uarr4,uarr8,uarr10,uarr33,uarr36/ 1503.03585 . \uparrow 4, \uparrow 8, \uparrow 10, \uparrow 33, \uparrow 36/1503.03585.4,8,10,33,36
Gowthami Somepalli, Vasu Singla, Micah Goldblum, Jonas Geiping, and Tom Goldstein. Diffusion art or digital forgery? investigating data replication in diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6048-6058, 2023. 23 23 uarr23\uparrow 2323
Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. In International Conference on Learning Representations, 2021. URL https://openreview. net/forum?id=St1giarCHLP. 3 3 uarr3\uparrow 33, 16 , 32 16 , 32 uarr16,uarr32\uparrow 16, \uparrow 3216,32
Jiaming Song, Chenlin Meng, and Arash Vahdat. Cvpr 2023 tutorial: Denoising diffusion models: A generative learning big bang, 2023a. URL https://cvpr2023-tutorial-diffusion-models.github.io. 36 36 uarr36\uparrow 3636
Yang Song. Generative modeling by estimating gradients of the data distribution, 2021. URL https: //yang-song.net/blog/2021/score/. 13 13 uarr13\uparrow 1313
Yang Song. Diffusion and score-based generative models, 2023. URL https://www.youtube.com/watch?v= wMmqCMwuM2Q. 36 36 uarr36\uparrow 3636
Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution. Advances in neural information processing systems, 32, 2019. 32 32 uarr32\uparrow 3232
Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456, 2020. URL https://arxiv.org/pdf/2011.13456.pdf. 14 , 21 , 32 , 37 , 40 14 , 21 , 32 , 37 , 40 uarr14,uarr21,uarr32,uarr37,uarr40\uparrow 14, \uparrow 21, \uparrow 32, \uparrow 37, \uparrow 4014,21,32,37,40
Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consistency models. arXiv preprint arXiv:2303.01469, 2023b. 32 32 uarr32\uparrow 3232
Hannes Stark, Bowen Jing, Chenyu Wang, Gabriele Corso, Bonnie Berger, Regina Barzilay, and Tommi Jaakkola. Dirichlet flow matching with applications to dna sequence design, 2024. 31 31 uarr31\uparrow 3131
Alexander Tong, Nikolay Malkin, Kilian Fatras, Lazar Atanackovic, Yanlei Zhang, Guillaume Huguet, Guy Wolf, and Yoshua Bengio. Simulation-free schr $$ " odinger bridges via score and flow matching. arXiv preprint arXiv:2307.03672, 2023. 31 31 uarr31\uparrow 3131
Angus Turner. Diffusion models as a kind of vae, June 2021. URL https://angusturner.github.io/ generative_models/2021/06/29/diffusion-probabilistic-models-I.html. 33 33 uarr33\uparrow 3333
Ludwig Winkler. Reverse time stochastic differential equations [for generative modeling], 2021. URL https://ludwigwinkler.github.io/blog/ReverseTimeAnderson/. 14 14 uarr14\uparrow 1414
Ludwig Winkler. Fokker, planck, and ito, 2023. URL https://ludwigwinkler.github.io/blog/ FokkerPlanck/. 41 41 uarr41\uparrow 4141
Yanwu Xu, Yang Zhao, Zhisheng Xiao, and Tingbo Hou. Ufogen: You forward once large scale text-toimage generation via diffusion gans. arXiv preprint arXiv:2311.09257, 2023. 32 32 uarr32\uparrow 3232
Chenyang Yuan. Diffusion models from scratch, from a new theoretical perspective, 2024. URL https: //www. chenyang.co/diffusion.html. 36 36 uarr36\uparrow 3636
Qinsheng Zhang and Yongxin Chen. Fast sampling of diffusion models with exponential integrator. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview. net/forum? id=Loek7hfb46P. 32 32 uarr32\uparrow 3232

  1. 1 1 ^(1){ }^{1}1 These stand for Denoising Diffusion Probabilistic Models (DDPM) and Denoising Diffusion Implicit Models (DDIM), following Ho et al. [2020] and Song et al. [2021].
  2. 2 2 ^(2){ }^{2}2 One benefit of using this particular forward process is computational: we can directly sample x t x t x_(t)x_{t}xt given x 0 x 0 x_(0)x_{0}x0 in constant time.
    3 3 ^(3){ }^{3}3 Formally, p T p T p_(T)p_{T}pT is close in KL divergence to N ( 0 , T σ 2 ) N 0 , T σ 2 N(0,Tsigma^(2))\mathcal{N}\left(0, T \sigma^{2}\right)N(0,Tσ2), assuming p 0 p 0 p_(0)p_{0}p0 has bounded moments.
  3. 12 12 ^(12){ }^{12}12 This naturally suggests taking the continuous-time limit, which we discuss in Section 2.4, though it is not needed for most of our arguments.
  4. 14 14 ^(14){ }^{14}14 In practice, it is common to share parameters when learning the different regression functions { μ t } t μ t t {mu_(t)}_(t)\left\{\mu_{t}\right\}_{t}{μt}t, instead of learning a separate function for each timestep independently. This is usually implemented by training a model f θ f θ f_(theta)f_{\theta}fθ that accepts the time t t ttt as an additional argument, such that f θ ( x t , t ) μ t ( x t ) f θ x t , t μ t x t f_(theta)(x_(t),t)~~mu_(t)(x_(t))f_{\theta}\left(x_{t}, t\right) \approx \mu_{t}\left(x_{t}\right)fθ(xt,t)μt(xt).
  5. 18 18 ^(18){ }^{18}18 The chain rule for KL implies that we can add up these per-step errors: the approximation error for the final sample is bounded by the sum of all the per-step errors.
  6. 21 21 ^(21){ }^{21}21 See Eldan [2024] for a high-level overview of Brownian motions and Itô's formula. See also Evans [2012] for a gentle introductory textbook, and Kloeden and Platen [2011] for numerical methods.
  7. 26 26 ^(26){ }^{26}26 Because we can just "shift" our coordinates to make it so. Formally, our entire setup including Equation 35 is translation-symmetric.
  8. 30 30 ^(30){ }^{30}30 See Claim 5 in Appendix B. 3 for an explicit statement.
  9. 35 35 ^(35){ }^{35}35 We add conditioning x 0 = a x 0 = a x_(0)=ax_{0}=ax0=a, because we want to take expectations w.r.t the two-point mixture distribution, not the single-point distribution.
  10. 36 A 36 A ^(36)A{ }^{36} \mathrm{~A}36 A proof sketch is in appendix B.2. It involves rewriting the SDE noise term as the deterministic score (recall the connection between noise and score in equation (18)). Although it is deterministic, the score is unknown since it depends on p t p t p_(t)p_{t}pt.
    37 37 ^(37){ }^{37}37 To use a gas analogy: the SDE describes the (Brownian) motion of individual particles in a gas, while the PF-ODE describes the streamlines of the gas's velocity field. That is, the PF-ODE describes the motion of a "test particle" being transported by the gas- like a feather in the wind.
  11. 51 51 ^(51){ }^{51}51 Diffusion provides one possible construction, as we will see later in Section 4.6.
  12. 55 55 ^(55){ }^{55}55 See Appendix B. 6 for details on why (70) and (71) are equivalent along their trajectories.
  13. 57 57 ^(57){ }^{57}57 In practice, linear flows are most often instantiated with the independent coupling, not the above "diffusion coupling." However, for large enough terminal variance σ q 2 σ q 2 sigma_(q)^(2)\sigma_{q}^{2}σq2, the diffusion coupling is close to independent. Therefore, Claim 4 tells us that the common practice in flow matching (linear flows with a Gaussian terminal distribution and independent coupling) is nearly equivalent to standard DDIM, with a different time schedule. Finally, for the experts: this is a claim about the "variance exploding" version of DDIM, which is what we use throughout. Claim 4 is false for variance-preserving DDIM.