Preface #
Diffusion models (DMs) have recently gained widespread attention due to improved sampling quality and more stable training protocols. Notable examples include DALL·E 2 , which generates high-quality images from text prompts, and Sora , which focuses on video generation. In mechanics community, Nikolaos N. Vlassis, WaiChing Sun and Jan-Hendrik Bastek, Dennis M. Kochmann have successfully applied these (video) denoising diffusion models to the inverse design of microstructures/metamaterials with nonlinear properties. Inspired by their work, I aim to explore these areas further, starting with DMs.
This is a collection of notes documenting my learning process on DMs. In the first few posts, I mainly focused on understanding key terminologies and the underlying mathematical and statistical principles. And I’ll continue to update with code implementations when I have time. The main reference I’m using is the detailed and accessible papers Understanding Diffusion Models: A Unified Perspective and Tutorial on Diffusion Models for Imaging and Vision .
Generative Models #
The goal of a generative model is to learn to model the true data distribution $p(\pmb{x})$
of given observed samples $\pmb{x}$
from a distribution of interest.
Once learned, we can (1) generate new samples from our approximate model at will. Furthermore, under some formulations, we are able to use the learned model to (2) evaluate the likelihood of observed or sampled data as well.
Classification #
- Implicit generative models: Generative Adversarial Networks (GANs) model the sampling procedure of a complex distribution, which is learned in an adversarial manner. (adversarial — unstable)
- Likelihood-based models: It seeks to learn a model that assigns a high likelihood to the observed data samples. Directly learn the probability density/mass function (PDF, PMF) of the distribution via (approximate) maximum likelihood. This includes autoregressive models, normalizing flows, Variational Autoencoders (VAEs), energy-based modeling, and score-based generative models (score functions: gradients of log PDFs).
- DMs have both likelihood-based and score-based interpretations.
Given a set of independent identically distributed data points
$\pmb{\mathrm{X}}=(x_1,\cdots,x_n)$
, where$x_i\sim p(\pmb{x}|\pmb{\theta})$
according to some probability distribution parameterized by$\pmb{\theta}$
, where$\pmb{\theta}$
itself is a random variable described by a distribution, i.e.$\pmb{\theta}\sim p(\pmb{\theta}|\pmb{\alpha})$
, the marginal likelihood in general asks what the probability$p(\pmb{\mathrm{X}}|\pmb{\alpha})$
is, where$\pmb{\theta}$
has been marginalized out (integrated out):$p(\pmb{\mathrm{X}}|\pmb{\alpha})=\int_{\pmb{\theta}} p(\pmb{\mathrm{X}}|\pmb{\theta})\,p(\pmb{\theta}|\pmb{\alpha})\,\mathrm{d}\pmb{\theta}$
. The above definition is phrased in the context of Bayesian statistics in which case$p(\pmb{\theta}|\pmb{\alpha})$
is called prior density and$p(\pmb{\mathrm{X}}|\pmb{\theta})$
is the likelihood.
Caveats #
In generative modeling, we generally seek to learn lower-dimensional latent representations rather than higher-dimensional ones. This is because trying to learn a representation of higher dimension than the observation is a fruitless endeavor without strong priors. On the other hand, learning lower-dimensional latents can also be seen as a form of compression, and can potentially uncover semantically meaningful structure describing observations. A representative interpretation can be seen in the work of Wang et al. , which uses a VAE to map complex microstructures into a low-dimensional, continuous, and organized latent space.
Evidence Lower BOund (ELBO) #
We can think of the data we observe as represented or generated by an associated unseen latent, random variable $\pmb{z}$
.
Mathematically, the latent variables and the data we observe can be modeled by a joint distribution $p(\pmb{x},\pmb{z})$
. The “likelihood-based” branch of generative modeling aims to learn a model to maximize the likelihood $p(\pmb{x})$
of all observed $\pmb{x}$
. In order to recover the likelihood of purely the observed data $p(\pmb{x})$
from the joint distribution $p(\pmb{x},\pmb{z})$
, we can
- marginalize out the latent variable
$\pmb{z}$
:
$$p(\pmb{x})=\int p(\pmb{x}, \pmb{z})\,\mathrm{d}\pmb{z}, \tag{1} \label{eq1}$$
- or apply the chain rule of probability:
$$p(\pmb{x})=\frac{p(\pmb{x}, \pmb{z})}{p(\pmb{z}|\pmb{x})}, \tag{2} \label{eq2}$$
and $p(\pmb{z}|\pmb{x})$
is called the (true) posterior.
Note that, directly computing and maximizing the likelihood $p(\pmb{x})$
pose difficulties because, first, of the complexity introduced by integrating out all latent variables $\pmb{z}$
in Equation $\eqref{eq1}$
, and second, the calculation of Equation $\eqref{eq2}$
involves determining the ground truth latent encoder $p(\pmb{z}|\pmb{x})$
. However, these two equations can be used to derive the ELBO, which is quantified in this case as the log likelihood of the observed data. Here is the detailed derivation of its initial expression:
\begin{align} log\,p(\pmb{x}) &= log\int p(\pmb{x}, \pmb{z})\,\mathrm{d}\pmb{z} \tag{3}\\ &= log\int\frac{p(\pmb{x}, \pmb{z})q_{\pmb{\phi}}(\pmb{z}|\pmb{x})}{q_{\pmb{\phi}}(\pmb{z}|\pmb{x})}\,\mathrm{d}\pmb{z} \tag{4}\\ &= log\mathbb{E}_{q_{\pmb{\phi}}(\pmb{z}|\pmb{x})}\left[\frac{p(\pmb{x}, \pmb{z})}{q_{\pmb{\phi}}(\pmb{z}|\pmb{x})}\right] \tag{5} \label{eq5}\\ &\geq\mathbb{E}_{q_{\pmb{\phi}}(\pmb{z}|\pmb{x})}\left[log\frac{p(\pmb{x}, \pmb{z})}{q_{\pmb{\phi}}(\pmb{z}|\pmb{x})}\right], \tag{6} \label{eq6} \end{align}
and the Jessen’s Inequality
is applied to yield Equation $\eqref{eq6}$
, which is the ELBO we want.
Here are some related comments regarding the above derivation:
- The evidence is quantified in this case as the log likelihood of the observed data.
- A flexible approximate variational distribution
$q_{\pmb{\phi}}(\pmb{z}|\pmb{x})$
can be thought of as a parametrizable model (with parameters being$\pmb{\phi}$
) that is learned to estimate the true distribution over latent variables for given observations$\pmb{x}$
, i.e.,$q_{\pmb{\phi}}(\pmb{z}|\pmb{x})\to p(\pmb{z}|\pmb{x})$
by tuning$\pmb{\phi}$
. - This deriving process unveils few information why the ELBO is actually a lower bound of the evidence and why it is the objective we would like to maximize.
Why ELBO? #
Check out another way of derivation:
\begin{align} log\,p(\pmb{x}) &= log\,p(\pmb{x})\int q_{\pmb{\phi}}(\pmb{z}|\pmb{x})\,\mathrm{d}\pmb{z} \tag{7}\\ &= \int q_{\pmb{\phi}}(\pmb{z}|\pmb{x})\left(log\,p(\pmb{x})\right)\,\mathrm{d}\pmb{z} \tag{8}\\ &= \mathbb{E}_{q_{\pmb{\phi}}(\pmb{z}|\pmb{x})}\left[log\,p(\pmb{x})\right] \tag{9}\\ &= \mathbb{E}_{q_{\pmb{\phi}}(\pmb{z}|\pmb{x})}\left[log\frac{p(\pmb{x}, \pmb{z})}{p(\pmb{z}|\pmb{x})}\right] \tag{10}\\ &= \mathbb{E}_{q_{\pmb{\phi}}(\pmb{z}|\pmb{x})}\left[log\frac{p(\pmb{x}, \pmb{z})}{q_{\pmb{\phi}}(\pmb{z}|\pmb{x})}\right]+\mathbb{E}_{q_{\pmb{\phi}}(\pmb{z}|\pmb{x})}\left[log\frac{q_{\pmb{\phi}}(\pmb{z}|\pmb{x})}{p(\pmb{z}|\pmb{x})}\right] \tag{11} \label{eq11}\\ &= \mathbb{E}_{q_{\pmb{\phi}}(\pmb{z}|\pmb{x})}\left[log\frac{p(\pmb{x}, \pmb{z})}{q_{\pmb{\phi}}(\pmb{z}|\pmb{x})}\right]+D_{\mathrm{KL}}\left(q_{\pmb{\phi}}(\pmb{z}|\pmb{x})\|p(\pmb{z}|\pmb{x})\right) \tag{12} \label{eq12}\\ &\geq\mathbb{E}_{q_{\pmb{\phi}}(\pmb{z}|\pmb{x})}\left[log\frac{p(\pmb{x}, \pmb{z})}{q_{\pmb{\phi}}(\pmb{z}|\pmb{x})}\right], \tag{13} \end{align}
the definition of Kullback-Leibler Divergence
is used to obtain Equation $\eqref{eq12}$
from Equation $\eqref{eq11}$
, which is the core to understand why optimizing the ELBO is an appropriate objective. (Specifically, the KL Divergence measures the extra surprise caused by the model $Q$
and true distributions $P$
being different, beyond the surprise inherent in $P$
, i.e., $\text{CrossEntropy}(P,Q)-\text{Entropy}(P)$
) Now, it’s time to give answers to two questions mentioned above:
- Why the ELBO is a lower bound? —- the difference between the evidence and the ELBO is a strictly non-negative KL term, thus the value of the ELBO never exceeds the evidence
$log\,p(\pmb{x})$
. - Why we seek to maximize the ELBO? —- with the introduced latent variables
$\pmb{z}$
, our goal is to optimize the parameters of the variational posterior$q_{\pmb{\phi}}(\pmb{z}|\pmb{x})$
to exactly match the true posterior distribution$p(\pmb{z}|\pmb{x})$
(unknown ground truth), which is achieved by minimizing their KL Divergence (ideally to zero). Note also that, the ELBO plus KL Divergence is a constant with respect to$\pmb{\phi}$
, any maximization of the ELBO w.r.t.$\pmb{\phi}$
necessarily invokes an equal minimization of the KL Divergence.
\begin{align} \text{maximize the ELBO}\quad\Longrightarrow\quad q_{\pmb{\phi}}(\pmb{z}|\pmb{x})&\to p(\pmb{z}|\pmb{x})\\ \text{the ELBO}&\to log\,p(\pmb{x}) \end{align}
#diffusion models #machine learningIn general, maximizing the ELBO becomes a proxy objective with which to optimize a latent variable model, and once trained, the ELBO can be used to estimate the likelihood of observed or generated data.
Last modified on 2025-03-16