Fokker-Planck equation
We derive the Fokker-Planck equation
which describes the evolution of a density $p$ under the flow of a stochastic differential equation (SDE)
where $v$ is a vector field, $W$ is the Wiener process (Brownian motion), and $\eta > 0$ is a noise parameter. The SDE above is interepreted in the Itô sense, which is like the forward Euler method, see below.
Notice that when $\eta = 0$ we recover the deterministic differential equation $dX = v(X) dt$, or $\dot X_t = v(X_t)$, and the Fokker-Planck equation above reduces to the continuity equation that we have seen last time.
But what is interesting in the stochastic case is when the vector field $v$ is the (negative) gradient of a potential or objective function $f$:
then for any $\eta > 0$, the SDE above has an explicit stationary distribution, called the Gibbs distribution:
where $Z = \int e^{-f(x)/\eta} dx$ is the normalizing constant. When $f$ is strongly convex, we can show convergence exponential fast of any initial density to the stationary density $p_\eta$ as $t \to \infty$; this is the celebrated Bakry-Emery theory. Note also that as $\eta \to 0$, $p_\eta \to \delta_{x^\ast}$, a point mass at the minimizer $x^\ast = \arg\min_x f(x)$. Thus, convex optimization is the deterministic case of stochastic gradient flow.
Moreover, the Gibbs distribution above is an exponential family distribution, which recall is the maximum entropy distribution, and we can in fact show that stochastic gradient flow is the gradient flow in the space of probability measures! But before we go any further, let us first understand the basic properties.
As before, this is following Mackey’s Time’s Arrow, particularly $\S11B$, and Evans’ notes on SDE, $\S3$-$4$.
Wiener process
Recall that the Wiener process (or Brownian motion) on $\X = \R^d$ is a continuous-time stochastic process $W_t \in \X$, $t \ge 0$, with independent Gaussian increments, which means:
and $W$ has continuous sample paths almost surely.
The Wiener process has many bizarre properties. For example, on any compact domain $[0,T]$, the sample path is almost surely uniformly $\gamma$-Hölder continuous for any $0 < \gamma < \frac{1}{2}$, but not for any $\gamma > \frac{1}{2}$, so in particular, $W_t$ is nowhere differentiable in $t$.
Furthermore, the Wiener process has infinite variation, which means almost surely:
where the supremum is over all possible partitions $P$ of the interval $[0,T]$ into $n$ pieces, over all possible $n$ (or as $n \to \infty$). However, the Wiener process has a finite (linear) quadratic variation:
where the convergence is in $L^2$, i.e., if the quantity above is $Y_n$, then $\E[(Y_n-T)^2] \to 0$. Heuristically, if we write $t_{k+1} - t_k = dt$ and $W_{t_{k+1}} - W_{t_k} = dW$, then the above says that $\frac{T}{dt} \cdot (dW)^2 = T$, or:
That is, the differential $dW$ of the Wiener process operates on a scale that is the square root of the natural scale of time $dt$. Thus, a characteristic of stochastic dynamics is the presence of multiple timescales (as I first heard from Martin Wainwright).
Itô integral
Given a stochastic process $G$ on $[0,T]$, we wish to define the stochastic integral $\int_0^T G dW$, which should be a random variable, where $W$ is the Wiener process on $\X = \R$.
Following Evans $\S4$, let us first consider the case $G = W$, so we ask, what is a reasonable definition of $\int_0^T W dW$? A naive answer based on ordinary calculus is $\frac{1}{2} W_T^2$ (if $W_0=0$), but this turns out to be incorrect because of the quadratic variation property of the Wiener process.
Concretely, let us consider what happens when we try to approximate the integral by a summation. Fix $0 \le \lambda \le 1$, and for every partition
we define the Riemann sum approximation
where we are evaluating the integrand $W$ at the intermediate point $\tau_k = \tau_k(\lambda)$ defined by:
So for example, $\lambda = 0$ means $\tau_k = t_k$, which is like the forward Euler method (and is the preferred method of discretization since we don’t need to “know the future”), whereas $\lambda = 1$ means $\tau_k = t_{k+1}$, which is like the backward Euler method (which is supposedly more stable but difficult to implement).
Now let the size of the partition $|P| = \max_k |t_{k+1}-t_k| \to 0$, while keeping $\lambda$ fixed. In ordinary calculus, the choice of $\lambda$ does not matter and the sum converges to the integral. However, in stochastic calculus the choice of $\lambda$ does matter, essentially because of the quadratic variation property of $W$. Concretely, for any partition $P_n$ with $|P_n| \to 0$:
where the convergence is in $L^2$. So we see that there is a correction of order $T$ from the usual answer.
Now, the definition of Itô calculus corresponds to the choice of $\lambda = 0$ (like the forward Euler method, as noted above), so we have the formula:
This seems to be the standard definition in the literature that I’ve seen, and it has many nice properties (for instance, we can simulate the dynamics over time without knowing the future, as remarked above). Another main definition is the Stratonovich calculus, which corresponds to $\lambda = \frac{1}{2}$, so it is more symmetric and more similar to ordinary calculus; however, we will not discuss it further.
Then we can define the Itô integral of a general stochastic process $G$ the same way, by evaluating the sum approximation at the left time point:
where $G^{(n)}$ is a step process approximating $G$ that is constant on each subpartition $[t_k, t_{k+1})$, and the convergence is in $L^2$ (under reasonable conditions, e.g., $G$ is nonanticipating and $\E[\int_0^T G^2 dt] < \infty$). We can check that the Itô integral is a linear operator in $G$, and it has the following nice properties:
and more generally, $\E[(\int_0^T G dW)(\int_0^T H dW)] = \E[\int_0^T GH dt]$.
Itô's formula
Using the machinery of Itô integral above, we can now define what a stochastic differential equation is. Let $X_t \in \R$, $0 \le t \le T$, be a stochastic process satisfying
for some $F \in \L^1$, which means $\E[\int_0^T |F| dt] < \infty$, and $G \in \L^2$, which means $E[\int_0^T G^2 dt] < \infty$, and for all times $0 \le s \le t \le T$. Then we say that $X$ has the stochastic differential
for $0 \le t \le T$. Note that strictly speaking, the differential symbols $dX$, $dt$, $dW$ have no real meaning, and the equation above is really an abbreviation for the integral definition, which uses the Itô calculus.
An important result is the following Itô’s formula, which is just the chain rule for stochastic calculus, and the difference is because $dW^2 = dt$, so we also picks up second-order terms:
In particular, notice that $Y = u(X)$ also satisfies a stochastic differential equation.
More generally, consider a stochastic differential equation $dX = F dt + G dW$ on $\X = \R^d$, where $F = (F^1,\dots,F^d) \in \R^d$ is a vector field and $G = (G^{ij})$ is a $d \times d$ matrix (can also be a rectangular matrix), whose entries $F^i$, $G^{ij}$ are stochastic processes, and define the $d \times d$ covariance matrix:
Then Itô’s formula for $Y = u(X) \in \R$ takes the following form:
where $\langle A,B \rangle_{\Fr} = \Tr(AB^\top) = \sum_{i,j=1}^d A_{ij}B_{ij}$ is the Frobenius inner product between matrices. As before, we can derive this by formally expanding $du$ up to second-order terms and substituting $dW (dW)^\top = \I dt$, where $\I$ is the identity matrix.
Fokker-Planck equation
With Itô’s formula in hand, we are now ready to derive the Fokker-Planck equation. As before, we consider the stochastic differential equation on $\X = \R^d$:
where $F = F(X) \in \R^d$ is a vector field and $G = G(X) \in \R^{d \times d}$ is a matrix of standard deviations, and we assume the covariance matrix is positive definite:
for some $\rho > 0$ (this is called the “uniform parabolicity condition” in Mackey p. 143). Under some additional assumptions, for instance that $F$ and $G$ have bounded Lipschitz smooth entries, the solution $X_t$ to the SDE above is guaranteed to exist, which has a density $p(x,t)$ defined by
for any measurable set $A \subseteq \R^d$, where $dx$ is the Lebesgue measure.
Moreover, the density $p(x,t)$ satisfies an evolution equation that we can calculate as follows. Take an arbitrary smooth function $u \colon \X \to \R$ with compact support, and consider how $\E[u(X_t)]$ evolves. On the one hand, we have:
On the other hand, by using Itô’s formula we have:
Notice that the last expectation above is equal to zero, since the Wiener process increment $dW$ is independent of $X_t$, so the expectation splits:
and $\E[dW] = 0$. Therefore, we have the formula for the time derivative of $\E[u(X_t)]$:
Now to evaluate the expectations on the right hand side above, we use the magical property of the divergence operator $\nabla \cdot$ as the adjoint of the negative gradient $-\nabla$:
which is really just integration by parts, for any compactly supported function $u$ and vector field $v$ on $\X$. Applying this to the first expectation above, we obtain:
Similarly, by applying this twice to the second expectation above, which is less confusing if we write it in components, we get:
Therefore, by equating the two expressions for $\frac{d}{dt} \E[u(X_t)]$ above, we obtain:
Since $u$ is arbitrary, we conclude that the evolution equation for density $p(x,t)$ is:
which is the Fokker-Planck equation, also called the forward Kolmogorov equation. Notice that it can also be written as a continuity equation:
where $J = pF - \frac{1}{2} \nabla \cdot (pH)$ is the probability current.
For example, consider when $G = \sqrt{2\eta} \I$ for some constant $\eta > 0$, so the SDE is:
and the covariance matrix $H = 2\eta \I$ is diagonal. Then $\nabla \cdot (pH) = 2\eta \nabla p$, and so the Fokker-Planck equation becomes:
as we claimed in the beginning of this post, where $\Delta = \nabla \cdot \nabla = \sum_{i=1}^d \frac{\partial^2}{\partial x_i^2}$ is the Laplacian operator. In this case, the probability current becomes:
and we see that the effect of the stochastic differential $\sqrt{2\eta}dW$ is to modify the vector field from $F$ to $F - \eta \nabla \log p$, which now also depends on the evolving density $p$, reflecting the self-interaction property of stochastic dynamics.
Gibbs stationary distribution
Finally, consider what happens when the vector field $F$ is the negative gradient of a potential or objective function $f \colon \X \to \R$, so the SDE is:
which has the nice property that it always has a stationary distribution, given by the Gibbs distribution:
where $Z = \int_\X e^{-f(x)/\eta} dx$ is the normalization constant. This is because the choice of $p_\eta$ above satisfies $\nabla \log p_\eta = -\nabla f(x)/\eta = F/\eta$, so the probability current vanishes:
and $\part{p_\eta}{t} = 0$, so $p_\eta$ is a time-independent (stationary) solution to the Fokker-Planck equation. This means if we start with an initial density $p_\eta$, then it is always preserved under the flow of the SDE.
As remarked above in the beginning of this post, we can show convergence to this stationary density, for example when $f$ is strongly convex. We can also interpret the stochastic gradient flow as the gradient flow on the space of probability measures, which we will discuss further next time.
But for now, let us note how remarkable this result is, that the stationary measure $p_\eta$, which is the limiting distribution as $t \to \infty$, actually encodes the entire potential function $f$. In the deterministic case, as $\eta \to 0$, $p_\eta$ converges to a point mass $\delta_{x^\ast}$ at the minimizer $x^\ast = \arg\min_x f(x)$, which is the unique stationary point for the gradient flow dynamics $\dot X_t = -\nabla f(X_t)$ for convex optimization.
Finally, we also note that the fact that $p_\eta$ is a stationary measure is still not at all clear (at least to me) from the SDE formulation $dX = -\nabla f(X) dt + \sqrt{2\eta} dW$. It follows easily from the Fokker-Planck equation, as we saw above. But it would be nice to understand better how we can deduce the existence of the stationary measure $p_\eta$ from the SDE directly.