Last time we have seen how the Fokker-Planck equation characterizes the evolution of a density $f$ under the flow of a stochastic differential equation. We have also seen that for stochastic gradient flow, i.e., when the deterministic vector field is the (negative) gradient of a potential function $V \colon \X \to \R$:

there is a stationary Gibbs distribution $f_\eta(x) \propto e^{-V(x)/\eta}$ which is a fixed point of the evolution equation.

It turns out stochastic gradient flow can be seen as gradient flow on the space of probability measures. I found this very striking and curiously beautiful, when I first learned it from reading this book by Villani, because it suggests that one way to remove (or account for) randomness is to lift ourselves up from the base space to the space of measures, and in doing so we are back in the deterministic case again. So if we have a good theory or understanding of what happens for a general (deterministic) gradient flow, then we can translate some of the effects to understand stochastic gradient flow as well.

But first, let’s try to understand the basic setting. This is following Terry Tao’s blog post, particularly the beginning.


Heat equation and fundamental solution

Consider the simplest case when $V \equiv 0$, so the stochastic gradient flow above becomes:

and the Fokker-Planck equation becomes the heat equation (usually $\eta = \frac{1}{2}$):

where $\Delta := \sum_{i=1}^d \frac{\partial^2}{\partial x_i^2}$ is the Laplacian.

In this case, we have an explicit fundamental solution for all $t > 0$:

which describes the probability distribution of a (rescaled) Brownian motion starting at the origin at time $t = 0$, so at each time $t > 0$, $X_t \sim N(0, 2\eta t \I)$ has Gaussian distribution with mean $0$ and covariance matrix $2\eta t \I$. We can check this explicitly, since:

and moreover:

so we indeed have $\part{\g}{t} = \eta \nabla \g$. (Note that $d$ above is dimension, so $dt = d \times t$ is not the differential.)

So as $t \to \infty$, the distribution of $X_t \sim N(0, 2\eta t \I)$ becomes more and more diffuse, ultimately converging to the uniform distribution over space $\X = \R^d$ (but it is not a proper probability distribution since it cannot be normalized).


Heat equation is gradient flow of Dirichlet form

From Terry’s post, near the beginning:

``The heat equation can be viewed as the gradient flow for the Dirichlet form:

since one has the integration by parts identity:

for all smooth, rapidly decreasing $f,g$, which formally implies that $\Delta f$ is the negative gradient of the Dirichlet energy $D(f,f) = \frac{1}{2}\int_\X |\nabla f|^2 dx$ with respect to the $L^2(\X,dx)$ inner product.’’

It always takes me a while to understand what this means, so here is my explanation. The relevant vector space here is not $\X = \R^d$ anymore, but $\F = L^2(\X,dx)$, the vector space of square-integrable functions $f \colon \X \to \R$ with $\int_\X |f(x)|^2 dx < \infty$, which is an inner product space with inner product:

Then for an arbitrary functional $\E \colon \F \to \R$, we can define its gradient with respect to $f \in \F$ the usual way as “the vector of partial derivatives”, except that now $\F$ is an infinite-dimensional vector space (one dimesion per $x \in \X$), so the gradient is no longer a vector, but a function $\partial \E(f) \equiv \part{\E}{f} \colon \X \to \R$, which has the property that it gives the best linear approximation:

for any base point $f \in \F$ and tangent vector $g \in \T_f\F \cong \F$ (since $\F$ is a vector space).

Now take $\E(f)$ to be the Dirichlet energy:

as defined above, but rescaled by $\eta > 0$. Then by expanding the square:

where in the last step we have used the integration by parts identity $\int_\X \langle \nabla f, \nabla g \rangle dx = -\int_\X (\Delta f) g dx$.

Comparing this expansion with the definition of gradient above, we conclude that the Laplacian is indeed the negative gradient of the Dirichlet energy:

and therefore, the heat equation $\part{f}{t} = \eta \Delta f$ is indeed the (negative) gradient flow of the Dirichlet form.


Backward Kolmogorov equation

Now consider the stochastic gradient flow:

where $V \colon \X \to \R$ is a smooth potential function. As we saw last time, the Fokker-Planck equation is:

which has a stationary solution, the Gibbs distribution $f_\eta(x) = \frac{1}{Z} e^{-V(x)/\eta}$, where $Z = \int_\X e^{-V(x)/\eta} dx$.

The Fokker-Planck equation above is also called the forward Kolmogorov equation, since it describes the evolution of the density $f$ with respect to the Lebesgue measure $dx$. But we can also consider the evolution of the density $g = f/f_\eta = f e^{V(x)/\eta} Z$ with respect to the stationary measure $d\mu = f_\eta(x) dx$. The resulting equation is called the backward Kolmogorov equation.

To proceed, we plug in $f = g f_\eta = \frac{1}{Z} g e^{-V(x)/\eta}$ to the forward Kolmogorov equation above. We have:

and:

Then:

Therefore, the right hand side of the Fokker-Planck equation becomes:

Thus, the Fokker-Planck equation $\part{f}{t} = \nabla \cdot(f \nabla V) + \eta \Delta f$ yields the evolution equation for $g$:

which is the backward Kolmogorov equation. Note that $g = 1$ is a solution, which reflects the fact that $f_\mu = 1 \cdot f_\mu$ satisfies the Fokker-Planck equation.


Stochastic gradient flow is gradient flow of Dirichlet form

Let $V \colon \X \to \R$ be a smooth potential function as before with $Z = \int_\X e^{-V(x)/\eta} dx < \infty$, and consider using the base measure

which is the stationary measure of the stochastic gradient flow $dX = -\nabla V(X) dt + \sqrt{2\eta} dW$.

Let $\F = L^2(\X,d\mu)$ be the vector space of square-integrable functions $g \colon \X \to \R$ with respect to $d\mu$, $\int_\X g^2 d\mu < \infty$, which is an inner product space with inner product:

Then for $g,h \in \F$, we can define the Dirichlet form with respect to $d\mu$:

where $\langle \cdot, \cdot \rangle$ is still the usual $\ell_2$ inner product. Similarly, we define the Dirichlet energy:

which is our functional of interest.

Let us now calculate the gradient $\partial \E(g)$ from the expansion:

By expanding the square:

And by integration by parts, now taking into account the base measure $d\mu$:

Thus, we find that the gradient of the Dirichlet energy with respect to the $L^2(\X,d\mu)$ inner product structure is given by:

Therefore, the (negative) gradient flow of the Dirichlet energy is given by the equation:

which we recognize as the backward Kolmogorov equation, as derived in the previous section.

Thus, we conclude that the stochastic gradient flow, in particular the backward Kolmogorov equation, is the gradient flow for the Dirichlet energy. This gives many nice properties, and under some additional assumptions also yields convergence bounds, as we shall explore further next time.