Diffusion Models


1. the problem being solved

you have a dataset. samples $x^{(1)}, \ldots, x^{(N)} \in \mathbb{R}^n$. they come from some unknown distribution $p_{\text{data}}(x)$ that lives on or near a low-dimensional manifold $\mathcal{K} \subset \mathbb{R}^n$.

you want to generate new samples from $p_{\text{data}}$. but you don’t know $p_{\text{data}}$ — you only have samples.

why is this hard? $p_{\text{data}}$ is multimodal, high-dimensional, and intractably complex. normalizing flows require tractable Jacobians. VAEs introduce posterior collapse. GANs are notoriously unstable. diffusion models sidestep all of this by reformulating generation as iterative denoising.

the core insight: if you can learn to denoise, you can learn to generate.


2. the forward process — adding noise

2.1 the stochastic differential equation (SDE) view

the forward process takes a clean data point $x_0 \sim p_{\text{data}}$ and progressively destroys it by adding Gaussian noise. in continuous time, this is an Itô SDE:

\[dx_t = f(x_t, t)\, dt + g(t)\, dW_t\]

where $W_t$ is standard Brownian motion, $f$ is the drift (deterministic tendency), and $g(t)$ is the diffusion coefficient (noise intensity).

for the variance-exploding (VE) case (used in EDM, closer to the Yuan/Permenter setup):

\[f(x, t) = 0, \quad g(t) = \sigma(t)\sqrt{2\dot\sigma(t)/\sigma(t)} = \sqrt{2\dot\sigma \sigma}\]

which gives the marginal:

\[x_t = x_0 + \sigma(t)\, \epsilon, \quad \epsilon \sim \mathcal{N}(0, I)\]

so at time $t$, the noisy version is just $x_0$ plus Gaussian noise scaled by $\sigma(t)$. the noise schedule $\sigma(t)$ is a monotone increasing function: $\sigma(0) \approx 0$, $\sigma(T) \gg \text{diam}(\mathcal{K})$.

2.2 what does the marginal distribution look like?

since $x_t = x_0 + \sigma_t \epsilon$ with $x_0 \sim p_{\text{data}}$, the marginal density is a convolution:

\[p_t(x) = \int p_{\text{data}}(x_0)\, \mathcal{N}(x; x_0, \sigma_t^2 I)\, dx_0\]

this is the data distribution blurred by a Gaussian kernel of width $\sigma_t$. as $\sigma_t \to \infty$, this approaches $\mathcal{N}(0, \sigma_t^2 I)$ regardless of $p_{\text{data}}$ — pure noise obliterates all structure.

physically: imagine a probability cloud over $\mathcal{K}$. at $t=0$ the cloud tightly wraps $\mathcal{K}$. as $t$ increases, the cloud expands and diffuses until it’s just a spherical Gaussian. generation = run this process in reverse.

2.3 why Gaussian noise specifically?

two reasons. first, Gaussians are isotropic — they don’t prefer any direction, so the noise is unbiased. second, the convolution \(p_t = p_{\text{data}} * \mathcal{N}(0, \sigma_t^2 I)\) is differentiable in $x$ for all $t > 0$ even if $p_{\text{data}}$ has delta-function support (discrete data). this smoothness is what makes score functions well-defined everywhere.

INTERACTIVE — FORWARD PROCESS: THREE NOISE LEVELS SIDE BY SIDE

σ = 0.40

3. the score function — what it means geometrically

3.1 definition

the score function at noise level $\sigma$ is:

\[s_\sigma(x) := \nabla_x \log p_\sigma(x)\]

where $p_\sigma$ is the marginal density of $x = x_0 + \sigma\epsilon$. the gradient is with respect to $x$.

what does this mean? $p_\sigma(x)$ is a probability density over $\mathbb{R}^n$. $\log p_\sigma(x)$ is the log-density — a scalar field. $\nabla_x \log p_\sigma(x)$ is a vector field that points uphill on the log-density — toward regions of higher probability.

geometrically: the score points from $x$ toward the nearest high-density region of $p_\sigma$. if you’re in a low-probability region (like pure noise), the score points you toward the data manifold $\mathcal{K}$. if you follow the score (Langevin dynamics), you drift toward $p_\sigma$.

3.2 score ↔ denoiser: Tweedie’s formula

here is the central identity that underlies all of diffusion. for $x_\sigma = x_0 + \sigma\epsilon$ where $x_0 \sim p_{\text{data}}$:

\[\boxed{\mathbb{E}[x_0 \mid x_\sigma] = x_\sigma + \sigma^2 \nabla_{x_\sigma} \log p_\sigma(x_\sigma)}\]

this is Tweedie’s formula (Maurice Tweedie, 1956). let’s derive it.

derivation. the joint density factors as:

\[p(x_0, x_\sigma) = p_{\text{data}}(x_0)\, \mathcal{N}(x_\sigma; x_0, \sigma^2 I)\]

the marginal is:

\[p_\sigma(x_\sigma) = \int p_{\text{data}}(x_0)\, \mathcal{N}(x_\sigma; x_0, \sigma^2 I)\, dx_0\]

take $\nabla_{x_\sigma}$ of both sides:

\[\nabla_{x_\sigma} p_\sigma(x_\sigma) = \int p_{\text{data}}(x_0)\, \nabla_{x_\sigma}\mathcal{N}(x_\sigma; x_0, \sigma^2 I)\, dx_0\]

now use the fact that for Gaussians:

\[\nabla_{x_\sigma} \mathcal{N}(x_\sigma; x_0, \sigma^2 I) = \mathcal{N}(x_\sigma; x_0, \sigma^2 I)\, \frac{x_0 - x_\sigma}{\sigma^2}\]

(this is just the gradient of $-\frac{|x_\sigma - x_0|^2}{2\sigma^2}$ with respect to $x_\sigma$). substitute:

\[\nabla_{x_\sigma} p_\sigma = \int p_{\text{data}}(x_0)\, \mathcal{N}(x_\sigma; x_0, \sigma^2 I)\, \frac{x_0 - x_\sigma}{\sigma^2}\, dx_0\]

divide both sides by $p_\sigma(x_\sigma)$ and use $p(x_0 \mid x_\sigma) = p_{\text{data}}(x_0)\mathcal{N}(x_\sigma; x_0, \sigma^2 I) / p_\sigma(x_\sigma)$:

\[\nabla_{x_\sigma} \log p_\sigma = \frac{1}{\sigma^2}\left(\int x_0\, p(x_0 \mid x_\sigma)\, dx_0 - x_\sigma\right) = \frac{\mathbb{E}[x_0 \mid x_\sigma] - x_\sigma}{\sigma^2}\]

rearranging: $\mathbb{E}[x_0 \mid x_\sigma] = x_\sigma + \sigma^2 \nabla \log p_\sigma(x_\sigma)$.

what this says: the score at a noisy point $x_\sigma$ tells you how to shift $x_\sigma$ to get the posterior mean of the clean data. the shift is $\sigma^2 \cdot s_\sigma(x_\sigma)$ — proportional to $\sigma^2$.

connection to denoising: define \(D_\sigma(x) := x + \sigma^2 \nabla \log p_\sigma(x) = \mathbb{E}[x_0 \mid x_\sigma = x]\)this is the optimal denoiser — it minimizes mean squared error $\mathbb{E}[|D_\sigma(x_\sigma) - x_0|^2]$. training a neural network to predict $x_0$ from $x_\sigma = x + \sigma\epsilon$ gives you access to the score via:

\[\nabla \log p_\sigma(x) = \frac{D_\sigma(x) - x}{\sigma^2}\]

equivalently, train to predict the noise $\epsilon_\theta(x_\sigma, \sigma) \approx \epsilon$, then:

\[\nabla \log p_\sigma(x) \approx -\frac{\epsilon_\theta(x, \sigma)}{\sigma}\]

(since $\epsilon = (x_\sigma - x_0)/\sigma$ and $D_\sigma(x) = x - \sigma\epsilon$).

INTERACTIVE — IDEAL DENOISER VECTOR FIELD (exact softmax-weighted mean)

σ = 0.80
Connection: score s_σ(x) = −ε*(x,σ)/σ = (x̂₀ − x)/σ². All three vector fields encode the same information — just rescaled. Training ε_θ to predict noise ≡ training to approximate the score.

3.3 score matching loss derivation

you can’t compute $\nabla \log p_\sigma(x)$ directly — you don’t know $p_\sigma$. but you can train a neural network $s_\theta(x, \sigma) \approx \nabla \log p_\sigma(x)$ by minimizing:

\[\mathcal{L}_{\text{SM}}(\theta) = \mathbb{E}_{x \sim p_\sigma}\left[\|s_\theta(x, \sigma) - \nabla_x \log p_\sigma(x)\|^2\right]\]

this requires $\nabla_x \log p_\sigma(x)$ — still unknown. but here’s the trick: by Stein’s identity, this loss equals (up to a constant independent of $\theta$):

\[\mathcal{L}_{\text{ISM}}(\theta) = \mathbb{E}\left[\text{tr}(\nabla_x s_\theta(x,\sigma)) + \frac{1}{2}\|s_\theta(x,\sigma)\|^2\right]\]

this is implicit score matching (Hyvärinen 2005). no $p_\sigma$ needed! but the trace term is expensive: computing $\text{tr}(\nabla_x s_\theta)$ in $n$ dimensions costs $O(n)$ backprop passes.

denoising score matching (Vincent 2011) is cheaper. it uses the conditional form: since \(p_\sigma(x) = \int p_{\text{data}}(x_0) p(x \mid x_0)\, dx_0\) and by Tweedie:

\[\nabla_{x} \log p_\sigma(x) = \frac{\mathbb{E}[x_0 \mid x] - x}{\sigma^2}\]

the score matching loss reduces to:

\[\mathcal{L}_{\text{DSM}}(\theta) = \mathbb{E}_{x_0, \epsilon}\left[\left\|\epsilon_\theta(x_0 + \sigma\epsilon, \sigma) - \epsilon\right\|^2\right]\]

this is just: given noisy data, predict the noise. beautifully simple, and provably equivalent to score matching. this is exactly the diffusion training objective.


4. the optimization geometry — denoising as projection

this is the Yuan/Permenter (ICML 2024) contribution: a purely geometric interpretation that doesn’t need the SDE/score framework.

4.1 distance functions and the data manifold

for the data set $\mathcal{K}$ (treated as a finite set or compact manifold), define:

distance function: \(\text{dist}_\mathcal{K}(x) = \min_{x_0 \in \mathcal{K}} \|x - x_0\|\)

projection: \(\text{proj}_\mathcal{K}(x) = \arg\min_{x_0 \in \mathcal{K}} \|x - x_0\|\)

key fact: \(\nabla_x \frac{1}{2}\text{dist}_\mathcal{K}^2(x) = x - \text{proj}_\mathcal{K}(x)\)

proof: let $x^* = \text{proj}_\mathcal{K}(x)$. then $\frac{1}{2}\text{dist}^2(x) = \frac{1}{2}|x - x^|^2$. by the envelope theorem (since $x^$ is the minimizer, its derivative vanishes):

\[\nabla_x \frac{1}{2}\|x-x^*(x)\|^2 = x - x^*\]

so the gradient of the squared distance points away from the data — gradient descent on $\frac{1}{2}\text{dist}^2$ moves toward $\mathcal{K}$.

problem: $\text{dist}_\mathcal{K}$ is non-differentiable at equidistant points (e.g., if $x$ is equidistant from two data points on a spiral, the gradient is undefined — kink). can’t train on this.

4.2 the softmin distance function

smooth it. replace $\min$ with $\text{softmin}_{\sigma^2}$ (which is $-\sigma^2 \log \sum \exp(-\cdot/\sigma^2)$):

\[\text{dist}^2_\mathcal{K}(x, \sigma) := -\sigma^2 \log\!\left(\sum_{x_0 \in \mathcal{K}} \exp\!\left(-\frac{\|x_0 - x\|^2}{2\sigma^2}\right)\right)\]

why softmin? it’s a log-sum-exp — the standard smooth approximation to $\min$. as $\sigma \to 0$: dominated by the $x_0$ closest to $x$, so \(\text{dist}^2_\mathcal{K}(x, \sigma) \to \text{dist}^2_\mathcal{K}(x)\) (hard distance). as $\sigma \to \infty$: $\exp(-|x_0-x|^2/2\sigma^2) \approx 1$ for all $x_0$, so

\[\text{dist}^2_\mathcal{K}(x, \sigma) \approx -\sigma^2 \log \lvert\mathcal{K}\rvert + \sigma^2 \cdot \text{const}\]

— flat, no geometry.

take the gradient:

\[\nabla_x \text{dist}^2_\mathcal{K}(x, \sigma) = 2\sigma^2 \cdot \frac{\sum_{x_0} (x - x_0)\exp(-\|x-x_0\|^2/2\sigma^2)}{\sum_{x_0} \exp(-\|x-x_0\|^2/2\sigma^2)}\] \[= 2\sum_{x_0} w_{x_0}(x, \sigma)\,(x - x_0), \quad w_{x_0} = \frac{\exp(-\|x-x_0\|^2/2\sigma^2)}{\sum_{x_0'}\exp(-\|x-x_0'\|^2/2\sigma^2)}\]

this is $2(x - \bar{x})$ where $\bar{x} = \sum_{x_0} w_{x_0} x_0$ is a softmax-weighted mean of the data points, weighted by proximity to $x$.

4.3 the key theorem: denoiser = gradient of smoothed distance

now compare with the ideal denoiser formula from before. for discrete uniform $\mathcal{K}$:

\[\epsilon^*(x, \sigma) = \frac{\sum_{x_0} (x - x_0)\exp(-\|x-x_0\|^2/2\sigma^2)}{\sigma \sum_{x_0}\exp(-\|x-x_0\|^2/2\sigma^2)}\]

comparing with $\nabla_x \text{dist}^2_\mathcal{K}(x,\sigma)$:

\[\boxed{\frac{1}{2}\nabla_x \text{dist}^2_\mathcal{K}(x,\sigma) = \sigma\, \epsilon^*(x,\sigma)}\]

the ideal denoiser (which the network $\epsilon_\theta$ approximates) is the gradient of the smoothed squared distance to $\mathcal{K}$, up to a factor of $\sigma$.

two perspectives, same object:

  • probabilistic: the denoiser estimates the score $\nabla \log p_\sigma$, which points toward high-density regions
  • geometric: the denoiser estimates $\nabla \text{dist}^2_\mathcal{K}$, which points toward $\mathcal{K}$

these aren’t just analogous — they’re the same vector field.

\(p_\sigma(x) \propto \exp(-\text{dist}^2_\mathcal{K}(x,\sigma)/\sigma^2)\) for discrete data (the Gibbs distribution), so \(\nabla \log p_\sigma = -\nabla \text{dist}^2_\mathcal{K} / \sigma^2\) consistent.

INTERACTIVE — SOFTMIN DISTANCE FUNCTION CONTOURS

σ = 0.50

5. DDIM sampling — derived from first principles

5.1 the setup

given a trained denoiser $\epsilon_\theta(x, \sigma) \approx \epsilon^*(x, \sigma)$, we want to generate $x_0 \approx \mathcal{K}$ by starting from noise and iteratively removing it.

why not one step? at high $\sigma_T$, the denoiser gives $\hat{x}0^T = x_T - \sigma_T \epsilon\theta(x_T, \sigma_T) \approx \bar{x}$ (mean of data). small relative error, huge absolute error — nowhere near any specific data point. need to iterate.

the plan: construct a sequence $x_T, x_{T-1}, \ldots, x_0$ where $|x_t - \mathcal{K}|$ decreases at each step, controlled by a schedule $\sigma_T > \sigma_{T-1} > \cdots > \sigma_0 \approx 0$.

5.2 DDIM as gradient descent

consider gradient descent on $f(x) = \frac{1}{2}\text{dist}^2_\mathcal{K}(x)$:

\[x_{t-1} = x_t - \alpha_t \nabla f(x_t)\]

we don’t have $\nabla f(x_t)$ exactly, but we have

\[\epsilon_\theta(x_t, \sigma_t) \approx \frac{1}{\sigma_t}\nabla_x \text{dist}^2_\mathcal{K}(x_t, \sigma_t) \approx \frac{\text{dist}_\mathcal{K}(x_t)}{\sigma_t}\nabla \text{dist}_\mathcal{K}(x_t)\]

(using the result that $\nabla \text{dist}^2 = 2\,\text{dist}\,\nabla\text{dist}$, and the approximation that \(\text{dist}^2_\mathcal{K}(x, \sigma) \approx \text{dist}^2_\mathcal{K}(x)\) when $\sigma \ll \text{dist}_\mathcal{K}(x)$).

the DDIM update is:

\[x_{t-1} = x_t - (\sigma_t - \sigma_{t-1})\,\epsilon_\theta(x_t, \sigma_t)\]

why this stepsize? if the relative error model holds — \(\sigma_t \approx \text{dist}_\mathcal{K}(x_t)/\sqrt{n}\) — then the effective stepsize of gradient descent on $\frac{1}{2}\text{dist}^2$ is:

\[\alpha_t = (\sigma_t - \sigma_{t-1}) \cdot \frac{1}{\sigma_t} \cdot \sigma_t = \sigma_t - \sigma_{t-1}\]

expressed as a fraction of the current distance: \(\alpha_t / \text{dist}^2_\mathcal{K}(x_t) \approx (\sigma_t - \sigma_{t-1})/\sigma_t^2 = 1/\sigma_{t-1} - 1/\sigma_t\). for log-linear schedule (\(\sigma_t = \sigma_\max \cdot r^t\)): this is \(r^{-t}/\sigma_\max - r^{-t+1}/\sigma_\max = \text{const}\) — constant relative stepsize. good convergence property.

5.3 the convergence theorem

relative error model: assume the learned \(\epsilon_\theta\) satisfies, when \(\sigma \approx \text{dist}_\mathcal{K}(x)/\sqrt{n}\):

\[\|x - \sigma\,\epsilon_\theta(x,\sigma) - \text{proj}_\mathcal{K}(x)\| \leq \eta\,\text{dist}_\mathcal{K}(x)\]

for some \(\eta \in [0, 1)\). the predicted clean point \(\hat{x}_0 = x - \sigma\epsilon_\theta\) lies within \(\eta \cdot \text{dist}_\mathcal{K}(x)\) of the true projection.

admissible schedule: the schedule \(\{\sigma_t\}\) is admissible if \(\sigma_t \approx \text{dist}_\mathcal{K}(x_t)/\sqrt{n}\) throughout sampling. a geometrically decreasing schedule (log-linear) is admissible by induction: if it holds at step \(t\), the gradient descent step reduces \(\text{dist}_\mathcal{K}(x_t)\) by a controlled factor that matches \(\sigma_{t-1}/\sigma_t\).

theorem: under these assumptions, \(\text{dist}_\mathcal{K}(x_t)/\sqrt{n} \approx \sigma_t\) for all \(t\), and \(\text{dist}_\mathcal{K}(x_0) \to 0\) as the schedule terminates. the algorithm converges to \(\mathcal{K}\).

5.4 DDIM ↔ SDE perspective equivalence

in the SDE view, the reverse process (Anderson 1982) for a forward SDE \(dx = f(x,t)dt + g(t)dW\) is:

\[d\overleftarrow{x} = \left[f(\overleftarrow{x}, t) - g(t)^2 \nabla_x \log p_t(\overleftarrow{x})\right] dt + g(t)\, d\bar{W}\]

for VE-SDE (\(f=0\), \(g = \sqrt{2\dot\sigma\sigma}\)) and the deterministic ODE limit (\(d\bar{W} = 0\)):

\[\frac{dx}{dt} = -g(t)^2 \nabla_x \log p_t(x) = -2\dot\sigma\sigma \cdot \frac{-\epsilon^*(x,\sigma)}{\sigma} = 2\dot\sigma\,\epsilon^*(x,\sigma)\]

discretizing with step \(\Delta t = 1\) and \(\sigma_{t-1} - \sigma_t = \dot\sigma \cdot \Delta t\):

\[x_{t-1} - x_t \approx 2(\sigma_{t-1} - \sigma_t)\,\epsilon^* \approx (\sigma_t - \sigma_{t-1}) \cdot (-2\epsilon^*)\]

up to sign convention (Yuan/Permenter use noise prediction, SDE uses score): this is DDIM. the two derivations agree. DDIM is the probability flow ODE for VE-SDE, discretized.

INTERACTIVE — SAMPLING TRAJECTORIES (8 independent runs from pure noise)

15 steps
γ=1.00
μ=0.00

6. the manifold hypothesis and low noise geometry

6.1 why denoising ≈ projection in the low noise regime

suppose \(\mathcal{K}\) is a smooth \(d\)-dimensional manifold in \(\mathbb{R}^n\) with \(d \ll n\).

at a point \(x \in \mathcal{K}\), the tangent space \(T_x\mathcal{K}\) is \(d\)-dimensional. the normal space \(N_x\mathcal{K}\) is \((n-d)\)-dimensional.

add noise \(x_\sigma = x + \sigma\epsilon\) with \(\epsilon \sim \mathcal{N}(0, I_n)\). decompose:

\[\epsilon = \epsilon_\parallel + \epsilon_\perp\]

where \(\epsilon_\parallel \in T_x\mathcal{K}\) (tangent) and \(\epsilon_\perp \in N_x\mathcal{K}\) (normal). then:

\[\|\epsilon_\parallel\|^2 \sim \chi^2(d), \quad \|\epsilon_\perp\|^2 \sim \chi^2(n-d)\]

expected values: \(\mathbb{E}\|\epsilon_\parallel\|^2 = d\), \(\mathbb{E}\|\epsilon_\perp\|^2 = n-d\).

the key ratio: \(\|\epsilon_\perp\|/\|\epsilon_\parallel\| \approx \sqrt{(n-d)/d} = \sqrt{n/d - 1} \gg 1\) when \(n \gg d\).

so for a high-dimensional space with a low-dimensional manifold, almost all the noise is in the normal direction. the denoiser needs to remove mostly normal-space noise. removing normal noise = projecting toward the manifold. this is why denoising ≈ projection when \(\sigma\) is small relative to \(\text{dist}_\mathcal{K}\).

more precisely: the optimal denoiser for \(x_\sigma = x_0 + \sigma\epsilon\) near \(\mathcal{K}\) sends \(x_\sigma \mapsto x_\sigma - \sigma\epsilon \approx x_\sigma - \sigma\epsilon_\perp = x_0 + \sigma\epsilon_\parallel\). this is a point on (or close to) \(\mathcal{K}\) in the normal direction from \(x_\sigma\), which is approximately \(\text{proj}_\mathcal{K}(x_\sigma)\). the tangent noise \(\sigma\epsilon_\parallel\) causes residual error of order \(\sigma\sqrt{d}\) — small when \(\sigma\) is small.

INTERACTIVE — TANGENT vs. NORMAL NOISE FRACTIONS (500 samples of ε ~ N(0,I_n))

n = 20
d = 2

6.2 why the relative error model holds at high noise

when \(\sigma \gg \text{diam}(\mathcal{K})\): all data points \(x_0\) are roughly equidistant from \(x\). the softmax weights \(w_{x_0} \approx 1/\lvert\mathcal{K}\rvert\) become uniform. so:

\[\hat{x}_0 = x - \sigma\epsilon^*(x,\sigma) \approx x - \frac{x - \bar{x}_\mathcal{K}}{\sigma}\sigma = \bar{x}_\mathcal{K}\]

the predicted \(x_0\) is close to the mean \(\bar{x}_\mathcal{K}\). the true projection \(\text{proj}_\mathcal{K}(x)\) is some point on \(\mathcal{K}\). since they’re both in \(\mathcal{K}\) (approximately), and \(\|x - \mathcal{K}\| \approx \|x\| \approx \sigma\sqrt{n}\) is large:

\[\|\hat{x}_0 - \text{proj}_\mathcal{K}(x)\| \leq \text{diam}(\mathcal{K}) \ll \sigma\sqrt{n} \approx \text{dist}_\mathcal{K}(x)\]

so relative error \(\eta \approx \text{diam}(\mathcal{K}) / (\sigma\sqrt{n}) \to 0\) as \(\sigma \to \infty\). the relative error model holds trivially at high noise.


7. improved sampling: gradient estimation and momentum

7.1 why DDIM makes errors

DDIM uses \(\epsilon_\theta(x_t, \sigma_t)\) as the gradient estimate at each step. but \(\epsilon_\theta(x_t, \sigma_t)\) approximates \(\nabla \text{dist}^2_\mathcal{K}(x_t, \sigma_t)/\sigma_t\) — the gradient of the smoothed distance, not the exact distance. when \(\sigma_t\) is large, the smoothed gradient points toward the weighted mean of data, not toward the closest specific point.

as \(x_t\) moves toward \(\mathcal{K}\) during sampling, the gradient estimate from step \(t+1\) is still available. since \(\nabla \text{dist}_\mathcal{K}(x)\) is constant along the line from \(x\) to \(\text{proj}_\mathcal{K}(x)\) (by the envelope theorem — the direction doesn’t change as you move toward the projection), the previous estimate is still valid.

7.2 the gradient estimation update

blend current and previous estimates:

\[\bar\epsilon_t = \gamma\,\epsilon_\theta(x_t, \sigma_t) + (1-\gamma)\,\epsilon_\theta(x_{t+1}, \sigma_{t+1})\]

for \(\gamma = 1\): pure DDIM. for \(\gamma = 2\): extrapolation — overshoot in the gradient direction to compensate for accumulated error. this is equivalent to momentum in optimization:

\[\bar\epsilon_t = \epsilon_\theta(x_t, \sigma_t) + [\epsilon_\theta(x_t, \sigma_t) - \epsilon_\theta(x_{t+1}, \sigma_{t+1})]\]

the bracket is the change in the gradient estimate — like the first-order correction in Heun’s method (the predictor-corrector ODE solver). in fact, the gradient estimation sampler is Heun’s method applied to the probability flow ODE.

7.3 stochastic noise injection and DDPM recovery

deterministic samplers (DDIM) can produce low-diversity outputs or get stuck in narrow modes. adding stochastic noise injects exploration:

\[x_{t-1} = x_t - (\sigma_t - \sigma_{t'})\,\bar\epsilon_t + \eta\, w_t, \quad w_t \sim \mathcal{N}(0, I)\]

the parameters must be chosen so the noise level remains consistent with the schedule. requiring that \(\mathbb{E}\|x_{t-1}\|^2\) evolves correctly (so \(x_{t-1}\) has effective noise level \(\sigma_{t-1}\)):

we need \(\sigma_{t-1}^2 = \sigma_{t'}^2 + \eta^2\), which gives \(\eta = \sqrt{\sigma_{t-1}^2 - \sigma_{t'}^2}\).

and \(\sigma_{t-1} = \sigma_t^\mu \sigma_{t'}^{1-\mu}\) (geometric interpolation between \(\sigma_t\) and \(\sigma_{t'}\)).

check \(\mu = 1/2\): \(\sigma_{t-1} = \sqrt{\sigma_t \sigma_{t'}}\), \(\sigma_{t'} = \sigma_{t-1}^2/\sigma_t\), \(\eta = \sqrt{\sigma_{t-1}^2 - \sigma_{t'}^2} = \sigma_{t-1}\sqrt{1 - \sigma_{t-1}^2/\sigma_t^2}\). this is exactly the DDPM noise coefficient derived from the Markov chain posterior. DDPM recovered. ✓


8. scale invariance and the log-linear schedule

8.1 why log-linear is special

a log-linear schedule has \(\sigma_t = \sigma_T \cdot r^{T-t}\) for some ratio \(r < 1\). equivalently: \(\log \sigma_t\) is linear in \(t\).

scale invariance: gradient descent on \(\frac{1}{2}\text{dist}^2\) with relative stepsize \((\sigma_t - \sigma_{t-1})/\sigma_t = 1 - r\) (constant!) has the property that if \(x_t\) is at relative distance \(\delta_t = \text{dist}_\mathcal{K}(x_t)/(\sigma_t\sqrt{n})\) from the manifold, then:

\[\delta_{t-1} \approx \delta_t \cdot (1 - (1-\eta)(\sigma_t - \sigma_{t-1})/\sigma_t) = \delta_t \cdot (1 - (1-\eta)(1-r))\]

so \(\delta_t\) decays geometrically: \(\delta_t \leq \delta_T \cdot (1 - (1-\eta)(1-r))^{T-t}\). the admissibility invariant \(\delta_t \approx 1\) holds because \(\sigma_t\) decreases at the same geometric rate as \(\text{dist}_\mathcal{K}(x_t)\).

8.2 connection to annealing in statistical physics

the forward process is like simulated annealing in reverse: start at high temperature (\(\sigma_T \to \infty\), Boltzmann distribution \(p_\sigma \propto \exp(-\text{dist}^2/\sigma^2)\) is flat), cool down slowly (decrease \(\sigma\)), and the system “freezes” into a low-energy state (a sample from \(p_{\text{data}}\)).

the Gibbs distribution at temperature \(T\): \(p \propto e^{-E/T}\). identify \(T \leftrightarrow \sigma^2\) and \(E \leftrightarrow \text{dist}^2_\mathcal{K}(x)\). large \(\sigma\) = high temperature = uniform distribution. small \(\sigma\) = low temperature = concentrated on \(\mathcal{K}\) (low energy states).

DDIM = annealing with zero temperature (deterministic). DDPM = annealing with controlled noise injection (Langevin dynamics at each temperature). the optimal annealing rate corresponds to the admissible schedule.

SCHEDULE COMPARISON — log-linear · DDPM-cosine · EDM/Karras


9. the noise prediction parametrization — and why it works

9.1 three equivalent parametrizations

the network can predict any of these, and they’re all equivalent via simple transformations:

parametrize predict recover \(x_0\) recover \(\nabla\log p\)
noise \(\epsilon\)-pred \(\epsilon_\theta(x, \sigma) \approx \epsilon\) \(\hat{x}_0 = x - \sigma\epsilon_\theta\) \(-\epsilon_\theta/\sigma\)
data \(x_0\)-pred \(D_\theta(x,\sigma) \approx x_0\) \(\hat{x}_0 = D_\theta\) \((D_\theta - x)/\sigma^2\)
score \(s\)-pred \(s_\theta(x,\sigma) \approx \nabla\log p_\sigma\) \(\hat{x}_0 = x + \sigma^2 s_\theta\) \(s_\theta\)

all three minimize the same objective (up to constant multipliers). the relationship: \(s_\theta = -\epsilon_\theta/\sigma = (D_\theta - x)/\sigma^2\).

9.2 why \(\epsilon\)-prediction has good training dynamics

the loss in \(\epsilon\)-form: \(\mathcal{L} = \mathbb{E}\|\epsilon_\theta(x_0 + \sigma\epsilon, \sigma) - \epsilon\|^2\)

the loss in \(x_0\)-form: \(\mathcal{L} = \sigma^{-2}\mathbb{E}\|D_\theta(x_\sigma, \sigma) - x_0\|^2\)

at large \(\sigma\): the \(\sigma^{-2}\) factor in \(x_0\)-form makes the loss tiny — the network doesn’t learn much from high-noise examples. the \(\epsilon\)-form weights all noise levels equally. in practice, practitioners reweight by \(\sigma^2\) to recover the \(x_0\) loss behavior, or use learned noise-level weighting (EDM’s preconditioning).

9.3 the sigma embedding — why not just use \(\sigma\) directly?

\(\sigma\) spans several orders of magnitude (e.g., 0.002 to 80 in EDM). raw \(\sigma\) as input creates optimization problems: gradients explode at large \(\sigma\).

the embedding \([\sin(\frac{1}{2}\log\sigma), \cos(\frac{1}{2}\log\sigma)]\):

  • maps \(\sigma \in (0, \infty)\) to a unit circle in \(\mathbb{R}^2\)
  • log-compresses the range — exponentially different \(\sigma\) values get reasonably separated
  • is continuous and periodic in \(\log\sigma\), matching the log-linear structure of the schedule

sinusoidal positional embeddings (used in Transformers) use the same principle: map position index to \([\sin, \cos]\) pairs at multiple frequencies. the two-dimensional version here is the simplest case.

INTERACTIVE — σ MAPPED ONTO THE UNIT CIRCLE + COMPONENT TIME SERIES

σ = 1.00

10. connections to other frameworks

10.1 energy-based models

an energy-based model defines \(p(x) \propto \exp(-E_\theta(x))\) for some learned energy. the score is \(-\nabla E_\theta(x)\). training EBMs requires contrastive divergence or MCMC — expensive.

diffusion = EBM where \(E_\sigma(x) = \text{dist}^2_\mathcal{K}(x, \sigma) / \sigma^2\) (from the Gibbs interpretation). but you never need MCMC — the score is learned directly from denoising.

10.2 variational autoencoders

VAE encodes \(x_0 \to z\) and decodes \(z \to x_0\) via a bottleneck. ELBO = reconstruction - KL divergence.

diffusion = hierarchical VAE with \(T\) latent levels \(x_1, x_2, \ldots, x_T\). the encoder \(q(x_t \mid x_0)\) is fixed (just add noise). the decoder \(p(x_{t-1} \mid x_t)\) is learned. the ELBO derivation of the diffusion objective (Kingma et al. 2021, VDM) exactly recovers the score matching loss as a sum of terms at each noise level.

10.3 normalizing flows

flows learn an invertible map \(f: \mathbb{R}^n \to \mathbb{R}^n\) from Gaussian to data. exact log-likelihood via change of variables. but \(f\) must be invertible — constrains architecture.

flow matching (Lipman et al. 2022) trains a vector field to move \(\mathcal{N}(0,I)\) to \(p_{\text{data}}\) via ODE. the probability flow ODE of diffusion is a flow matching ODE. these frameworks converge.


11. code — full implementation

training loop

import torch, torch.nn as nn
import numpy as np, math
from itertools import pairwise

class ScheduleLogLinear:
    def __init__(self, N, sigma_min=0.005, sigma_max=10.):
        self.sigmas = torch.logspace(math.log10(sigma_min), math.log10(sigma_max), N)
    def __len__(self): return len(self.sigmas)
    def __getitem__(self, i): return self.sigmas[i]
    def sample_batch(self, x0):
        idx = torch.randint(len(self), (x0.shape[0],))
        return self.sigmas[idx].to(x0)
    def sample_sigmas(self, steps):
        idx = list((len(self) * (1 - np.arange(0, steps)/steps)).round()
                   .astype(np.int64) - 1)
        return self[idx + [0]]

def training_loop(loader, model, schedule, epochs=10000):
    opt = torch.optim.Adam(model.parameters(), lr=1e-3)
    for _ in range(epochs):
        for x0 in loader:
            opt.zero_grad()
            sigma = schedule.sample_batch(x0)      # (B,) noise levels
            eps   = torch.randn_like(x0)            # (B, n) noise vectors
            eps_hat = model(x0 + sigma * eps, sigma)
            loss = nn.MSELoss()(eps_hat, eps)
            loss.backward()
            opt.step()

denoiser MLP (2D toy)

def get_sigma_embeds(sigma):
    s = sigma.unsqueeze(1)
    return torch.cat([torch.sin(torch.log(s)/2),
                      torch.cos(torch.log(s)/2)], dim=1)

class TimeInputMLP(nn.Module):
    def __init__(self, dim=2, hidden=(16,128,128,128,128,16)):
        super().__init__()
        layers = []
        dims = (dim + 2,) + hidden
        for d_in, d_out in zip(dims[:-1], dims[1:]):
            layers += [nn.Linear(d_in, d_out), nn.GELU()]
        layers.append(nn.Linear(hidden[-1], dim))
        self.net = nn.Sequential(*layers)
        self.dim = dim

    def rand_input(self, B): return torch.randn(B, self.dim)

    def forward(self, x, sigma):
        return self.net(torch.cat([x, get_sigma_embeds(sigma)], dim=1))

unified sampler (generalizes DDIM, DDPM, GE)

@torch.no_grad()
def samples(model, sigmas, gam=1., mu=0., batchsize=1):
    """
    gam=1, mu=0   → DDIM (deterministic)
    gam=1, mu=0.5 → DDPM (stochastic, ancestral)
    gam=2, mu=0   → Gradient Estimation (deterministic, momentum)

    sigmas: descending list of noise levels, length N+1 for N steps
    """
    xt  = model.rand_input(batchsize) * sigmas[0]
    eps = None
    for i, (sig, sig_prev) in enumerate(pairwise(sigmas)):
        eps, eps_prev = model(xt, sig.to(xt)), eps
        # momentum blend
        eps_av = eps * gam + eps_prev * (1-gam) if i > 0 else eps
        # intermediate sigma for noise injection
        sig_p = (sig_prev / sig**mu) ** (1/(1-mu))   # geom interp
        # noise coefficient
        eta = (sig_prev**2 - sig_p**2).sqrt()
        # update step
        xt = xt - (sig - sig_p) * eps_av + eta * model.rand_input(batchsize).to(xt)
        yield xt

ideal denoiser (analytical, for small datasets)

class IdealDenoiser:
    """exact optimal denoiser — tractable only for small K"""
    def __init__(self, dataset):
        self.data = torch.stack(list(dataset))   # |K| × n

    def __call__(self, x, sigma):
        x = x.flatten(1)                          # B × n
        d = self.data.flatten(1)                  # K × n
        # pairwise squared distances: B × K
        sq = (x**2).sum(1, keepdim=True) + (d**2).sum(1) - 2 * x @ d.T
        # softmax weights
        w = torch.softmax(-sq / (2 * sigma**2), dim=1)  # B × K
        # weighted mean prediction
        x0_hat = w @ d                             # B × n
        return (x - x0_hat) / sigma               # noise prediction form

12. quick reference: every equation with its meaning

Equation Meaning
\(x_\sigma = x_0 + \sigma\epsilon\) forward process — corrupt clean data with Gaussian noise
\(\mathcal{L}(\theta) = \mathbb{E}|\epsilon_\theta(x_0+\sigma\epsilon, \sigma) - \epsilon|^2\) training loss — predict the noise added (equivalent to score matching)
\(\hat{x}_0 = x - \sigma\epsilon_\theta(x, \sigma)\) predicted clean data from noisy input (Tweedie-form)
\(\mathbb{E}[x_0 \mid x_\sigma] = x_\sigma + \sigma^2\nabla\log p_\sigma(x_\sigma)\) Tweedie’s formula — score = direction toward posterior mean
\(s_\sigma(x) = -\epsilon_\theta(x,\sigma)/\sigma \approx \nabla\log p_\sigma(x)\) score ≈ negative rescaled noise prediction
\(\frac{1}{2}\nabla_x\text{dist}^2_\mathcal{K}(x,\sigma) = \sigma\,\epsilon^*(x,\sigma)\) ideal denoiser = gradient of smoothed distance (geometric interpretation)
\(x_{t-1} = x_t - (\sigma_t - \sigma_{t-1})\epsilon_\theta(x_t,\sigma_t)\) DDIM — gradient descent on \(\frac{1}{2}\text{dist}^2_\mathcal{K}\)
\(\bar\epsilon_t = \gamma\epsilon_t + (1-\gamma)\epsilon_{t+1}\) gradient estimation — momentum / Heun’s method
\(x_{t-1} = x_t - (\sigma_t-\sigma_{t'})\bar\epsilon_t + \sqrt{\sigma_{t-1}^2-\sigma_{t'}^2}\,w_t\) unified sampler with stochastic noise injection
\(\sigma_{t-1} = \sigma_t^\mu\sigma_{t'}^{1-\mu}\) noise interpolation parameter — \(\mu=0.5\) gives DDPM

13. sources


GitHub · RSS