Preface #
This blog will mainly focus on mathematical derivations of the (discrete time) Langevin equation (Equation (1) in this blog ) and its role in machine learning without delving too much into the field of statistical mechanics. For specifics on physics, one can refer to the classical book Nonequilibrium Statistical Mechanics by Robert Zwanzig .
Prerequisites #
Newton’s Second Law of Motion #
The change of motion (momentum) of an object is proportional to the force impressed; and is made in the direction of the straight line in which the force is impressed.
Specifically, the time derivative of the momentum is the force:
$$\pmb{F}=\frac{\mathrm{d}\pmb{p}}{\mathrm{d}t}=m\frac{\mathrm{d}\pmb{v}}{\mathrm{d}t}=m\pmb{a}, \tag{1} \label{eq1}$$
the second identity holds when the mass $m$
is a time-independent variable. $\pmb{v}$
and $\pmb{a}$
are the corresponding velocity and acceleration vectors, respectively.
Brownian Motion and Langevin Equation #
A Langevin equation is a kind of stochastic differential equation describing how a system evolves when subjected to a combination of deterministic and fluctuating (“random”) forces. And the original Langevin equation was proposed to describe Brownian motion, the apparently random movement of a small particle in a fluid due to collisions with the molecules of the fluid.
Basic physical assumptions of Brownian motion: A particle undergoing Brownian motion is generally subject to two primary forces
- Drag Force. A frictional force that opposes the particle’s motion, proportional to the velocity
$\pmb{v}(t)$
of the particle. - Random Force. A stochastic force due to collisions with fluid molecules, which varies with time and is often modeled as a random noise term.
Incorporating these two types of forces into the Newton’s second law gives
$$m\frac{\mathrm{d}\pmb{v}}{\mathrm{d}t}=-\lambda\pmb{v}+\pmb{\eta}(t), \tag{2} \label{eq2}$$
where $\lambda=6\pi\mu R$
is the friction (damping) coefficient, and the specific value is given by Stoke’s law
($\mu$
denotes the dynamic viscosity, $R$
is the particle radius). Here are some remarks regarding the characteristics of the random force $\pmb{\eta}(t)$
:
It represents the unpredictable kicks that a Brownian particle experiences due to collisions with surrounding fluid molecules, and we assumes it has the following two key properties.
- Zero Mean: Since such kicks from molecular collisions occur randomly in all possible directions, their averaged effects over time cancel out, which indicates that the random force is actually of zero mean
$$\left\langle\pmb{\eta}(t)\right\rangle=\pmb{0}.$$
- Uncorrelated (or “Delta-correlated”) in Time: Force at a time
$t$
is assumed to be independent of the force at any other time instant. Specifically, the autocorrelation is given by$$\left\langle\pmb{\eta}(t),\pmb{\eta}(t^\prime)\right\rangle=2D\delta(t-t^\prime),$$
where$D$
is a constant related to the strength of the random force and, finally, to the temperature of the system (fluctuation-dissipation theorem ).$\delta(t-t^\prime)$
is the Dirac delta function, which is zero for any$t\neq t^\prime$
and essentially “spikes” at$t=t^\prime$
, showcasing an uncorrelated relationship of forces between different time instants.
With all the properties satisfied, the random force can be modeled as a Gaussian distribution with $0$
mean and $\sigma^2$
variance:
$$\pmb{\eta}(t)\sim\mathcal{N}(\pmb{0}, \sigma^2\pmb{\mathrm{I}}),$$
this is an approximation to the actual random force, which always has a nonzero correlation time corresponding to the collision time of the molecules. However, the Langevin equation describes the motion of a “macroscopic” particle over much longer time scales. In this regime, the $\delta$
-correlation of the random force and the Langevin equation itself become nearly exact.
The Energy Perspective #
This part aims to relate the external force to the system energy, and we start from the work-energy principle. It is generally stated as
The work done by all forces acting on a particle (the work of the resultant force) equals the change in the kinetic energy of the particle.
Thus we have
$$\left\langle\pmb{F},\pmb{s}\right\rangle=\Delta E_{\mathrm{k}}=\frac{1}{2}m\|\pmb{v}_2\|^2-\frac{1}{2}m\|\pmb{v}_1\|^2, \tag{3} \label{eq3}$$
here $\pmb{s}$
is the displacement, $\left\langle\cdot,\cdot\right\rangle$
denotes the inner product of vectors. $E_{\mathrm{k}}$
is the particle’s kinetic energy and the symbol $\Delta$
means the change in kinetic energy. $\pmb{v}_1$
and $\pmb{v}_2$
are the velocities of the particle before and after the work is done. We further seek the infinitesimal representation of Equation $\eqref{eq3}$
:
\begin{align} &\left\langle\pmb{F}(\pmb{x}),\mathrm{d}\pmb{x}\right\rangle=\mathrm{d}E_{\mathrm{k}}(\pmb{x});\\ &\pmb{F}(\pmb{x})=\nabla_{\pmb{x}}E_{\mathrm{k}}(\pmb{x}), \tag{4} \label{eq4} \end{align}
therefore, the force can be interpreted as the gradient of kinetic energy w.r.t. $\pmb{x}$
.
We assume that the total energy of particles inside a system consists of the kinetic energy $E_{\mathrm{k}}(\pmb{x})$
and the potential energy $U(\pmb{x})$
. Mathematically, we get $E=U(\pmb{x})+E_{\mathrm{k}}(\pmb{x})$
, and Equation $\eqref{eq4}$
becomes
\begin{align} \pmb{F}(\pmb{x})&=\nabla_{\pmb{x}}E_{\mathrm{k}}(\pmb{x})\\ &=\nabla_{\pmb{x}}(E-U(\pmb{x}))\\ &=-\nabla_{\pmb{x}}U(\pmb{x}), \tag{5} \label{eq5} \end{align}
note that, energy conservation is used in the above derivation. Finally, we arrive at the conclusion — Force is the negative gradient of the particle’s potential energy.
Boltzmann Distribution #
The Boltzmann distribution (AKA Gibbs distribution) is a probability distribution that describes the likelihood of a thermal-equilibrium system being in a particular state based on that state’s energy and the system’s temperature. It’s a fundamental concept in statistical mechanics, especially for understanding how particles in a system, e.g., molecules in gas, distribute themselves among various energy levels.
WHAT IS THAT?
Mathematically, for a system with discrete energy levels, the probability $p_i$
of a particle being in state $i$
with energy $E_i$
is given by:
$$p_i=\frac{1}{Z}e^{-\frac{E_i}{k_\mathrm{B}T}}, \tag{6} \label{eq6}$$
where $E_i$
is the energy of state $i$
; $k_\mathrm{B}$
is the Boltzmann constant (a physical constant linking energy to temperature); $T$
denotes the absolute temperature; $Z=\sum_j e^{-E_j/(k_\mathrm{B}T)}$
is the partition function, a normalization factor to ensure the probabilities sum to 1.
Equation $\eqref{eq6}$
shows that states with lower energy are exponentially more likely to occur than states with higher energy, especially at lower temperatures. As temperature increases, the difference in probabilities between high-energy and low-energy states diminishes, leading to a more uniform distribution of states.
RELATIONSHIP TO ENERGY-BASED MODELS IN ML
In ML, Energy-Based Models (EBMs) borrow the concept of energy from physics to model complex probability distributions of structured data like images or natural language. The relationship between Boltzmann distributions and EBMs can be understood through the following points:
- Probability Modeling. In EBMs, an “energy function”
$E(\pmb{x})$
assigns a scalar value to each possible state$\pmb{x}$
, which measures the “likelihood” or “desirability” of that state: the model seeks to assign lower energy to more likely (or desirable) states. By applying a Boltzmann-like formula, EBMs can model the probability of each state$\pmb{x}$
(or data point) as:$$p(\pmb{x})=\frac{1}{Z}e^{-U(\pmb{x})}, \tag{7} \label{eq7}$$
here the potential energy$U(\pmb{x})$
is adopted as the energy function in the context of Langevin equation;$Z=\int e^{-U(\pmb{x})}\,\mathrm{d}\pmb{x}$
similarly serves as a normalization factor. - Learning and Inference. In an EBM, learning involves adjusting the parameters of the energy function so that the model assigns lower-energy values (higher probabilities) to desirable states (e.g., correct labels, plausible configurations) and higher-energy values (lower probabilities) to undesirable states. This is analogous to how the Boltzmann distribution encourages lower-energy states, but in ML, this process is tuned via optimization techniques rather than by adjusting physical parameters like temperature.
- Boltzmann Machines — A typical example of an EBM in ML. It is a type of stochastic neural network that directly uses the Boltzmann distribution to learn data patterns. It involves a set of interconnected binary nodes and learn to represent data by adjusting connection weights, effectively shaping the energy landscape so that desired patterns in the data correspond to low-energy states.
Derivation of Langevin Dynamics in ML #
With Equations $\eqref{eq1}$
and $\eqref{eq5}$
, the equation $\eqref{eq2}$
describing Brownian motion can be written as
$$-\nabla_{\pmb{x}}U(\pmb{x})=-\lambda\pmb{v}(t)+\pmb{\eta}(t),$$
moreover, according to previous discussions, the random force term here can be modeled as a Gaussian distribution, which can be further related to a standard Gaussian $\pmb{\epsilon}$
by $\pmb{\eta}=\sigma\pmb{\epsilon}$
. Thus, we obtain the following equation
$$\frac{\mathrm{d}\pmb{x}(t)}{\mathrm{d}t}=\frac{1}{\lambda}\nabla_{\pmb{x}}U(\pmb{x})+\frac{\sigma}{\lambda}\pmb{\epsilon},\quad \pmb{\epsilon}\sim\mathcal{N}(\pmb{0},\pmb{\mathrm{I}}).$$
From differentiation to discretization:
$$\pmb{x}_{t+1}-\pmb{x}_t=\frac{\Delta t}{\lambda}\nabla_{\pmb{x}}U(\pmb{x}_t)+\frac{\sigma\Delta t}{\lambda}\pmb{\epsilon},$$
where $\Delta t$
is the time step for sampling and $\lambda$
is recalled to be the friction coefficient. They are all constants. For simplicity, another constant $\tau=\frac{\Delta t}{\lambda}$
is introduced here. Finally, we get
$$\pmb{x}_{t+1}=\pmb{x}_t+\tau\nabla_{\pmb{x}}U(\pmb{x}_t)+\tau\sigma\pmb{\epsilon}.$$
Note that, Equation $\eqref{eq2}$
actually describes a motion that tends to drive particles to be spread out in order to maximize entropy (disorder) of a closed system. This process increases the potential energy and decreases the probability of the particle’s occurance. In contrast, for ML tasks, we aim for the generated sample to be the most probable one. Therefore, the sign of the gradient term $\nabla_{\pmb{x}}U(\pmb{x}_t)$
should be negative:
$$\pmb{x}_{t+1}=\pmb{x}_t-\tau\nabla_{\pmb{x}}U(\pmb{x}_t)+\tau\sigma\pmb{\epsilon}. \tag{8} \label{eq8}$$
Taking the gradient of the log of Equation $\eqref{eq7}$
, we have
$$\nabla_{\pmb{x}}log\,p(\pmb{x}_t)=\nabla_{\pmb{x}}\left\{log\frac{1}{Z}-U(\pmb{x}_t)\right\}=-\nabla_{\pmb{x}}U(\pmb{x}_t),$$
so Equation $\eqref{eq8}$
finally takes the form
$$\pmb{x}_{t+1}=\pmb{x}_t+\tau\nabla_{\pmb{x}}log\,p(\pmb{x}_t)+\tau\sigma\pmb{\epsilon}, \tag{9} \label{eq9}$$
this is exactly the (discrete-time) Langevin equation used for sampling in ML.
Unlike in physics, the constants involved here can either be omitted or specified arbitrarily for sampling algorithms. If we let $\sigma=\sqrt{2/\tau}$
, Equation $\eqref{eq9}$
becomes “Equation (1)” in this blog
:
$$\pmb{x}_{t+1}=\pmb{x}_t+\tau\nabla_{\pmb{x}}log\,p(\pmb{x}_t)+\sqrt{2\tau}\pmb{\epsilon},\quad \pmb{\epsilon}\sim\mathcal{N}\left(\pmb{\epsilon}; \pmb{0}, \pmb{\mathrm{I}}\right).$$
Last modified on 2025-03-16