Let $\X = \R^d$. A dynamics on $\X$ is defined to be the integral curve $X = (X(t))_{0 \le t < \infty}$ of a vector field $\v \colon \X \to \R^d$, which means at each time $t \ge 0$, the velocity of $X$ is given by the vector field $\v$ at its current position:

But what does this definition mean operationally (or algorithmically)? If at some time $t \ge 0$ we are at position $X(t) \in \X$ and know our velocity $\dot X(t) \in \R^d$, how do we decide where to go next?

The problem is that in continuous time, there is no “next” time, since there is always an infinity of possible times between now and any future time. The best we can say is that knowing the velocity $\dot X(t)$ tells us the approximate value of the position $X(t+\delta)$ for an infinitesimal value of $\delta$:

But in practice, we want to be able to simulate the dynamics as a discrete-time algorithm (that we can implement in a computer). We do so by discretizing time into a sequence of discrete points, and designing an algorithm that tries to follow the dynamics by going from one time point to the next. We hope that if the discretization is fine enough, then the discrete-time algorithm captures the behavior of the continuous-time dynamics well. For example, in the context of optimization, we want the discrete-time algorithm to exhibit a matching convergence rate as the continuous-time dynamics.


Forward and backward Euler

There are two distinct, yet subtly related ways that are the most “generic” in implementing dynamics generated by a vector field. Both are based on the idea of incrementing time by a discrete time step $\delta > 0$ and using a first-order approximation (in time) to the path.

The first scheme is forward Euler, which approximates the velocity $\dot X(t)$ by the finite difference $\frac{x_{k+1} - x_k}{\delta}$, and evaluates the vector field $\v(X(t))$ at the current position $\v(x_k)$, so:

This yields an explicit update step from $x_k$ to $x_{k+1}$:

which we can also write as:

where the forward Euler operator $\F_{\delta,\v} \colon \X \to \X$ is defined by:

Note that $\F_{0,\v} = \I$ is the identity. We assume $\F_{\delta,\v}$ is a diffeomorphism on $\X$ for small enough $\delta$.

The second scheme is backward Euler, which still approximates the velocity $\dot X(t)$ by the finite difference $\frac{x_{k+1}-x_k}{\delta}$, but now evaluates the vector field $\v(X(t))$ at the future position $\v(x_{k+1})$, so:

Unlike forward Euler above, we now have an implicit equation that we need to solve for $x_{k+1}$:

Collecting the terms involving $x_{k+1}$, we get:

so that we can write the backward Euler update as:

where the backward Euler operator $\B_{\delta,\v} \colon \X \to \X$ is given by:

Thus, the forward and backward Euler methods are adjoint to each other.

The advantage of forward Euler is that it gives an explicit update equation, so it is easier to implement in practice. On the other hand, backward Euler requires solving an implicit equation, so it is more expensive, but in general it has greater stability properties.

For small $\delta$, forward and backward Euler are almost the same, because we have the approximation:

and similarly:

where $\exp$ should be understood via its Taylor series expansion.

This also extends to the first $k$ steps of forward and backward Euler, since as long as $\delta k$ is not too large, we can still write that approximately:

and similarly:

This also agrees with what we saw last time that integrating the vector field $\v$ amounts to taking the exponential map, $X(t) = T_t(X(0))$ where $T_t = \exp(t \v)$. Here the continuous time $t$ is related to discrete time $k$ via the relation

since each step in discrete time corresponds to an elapse of $\delta$ time in continuous time.


Optimizer vector field

Among all vector fields on $\X$, there is a special class of optimizer vector fields which arise from optimizing some (regularized) objective functions. These vector fields are of particular interests to us because they can be used to design dynamics for optimization. The main example we have in mind is the gradient flow dynamics and its generalizations, including natural gradient flow and rescaled gradient flow. For this class of optimizer vector fields, we will see that the forward and backward Euler operators have nice forms.

Concretely, let

be the objective function we want to minimize. We assume $f$ is convex and differentiable with a unique minimizer $x^\ast = \arg\min_x f(x)$, characterized by $\nabla f(x^\ast) = 0$. Let

be a regularizer. We assume $\phi$ is convex, differentiable, and symmetric, which means

Consequently, since $\phi$ is convex,

which means $0 \in \arg\min_v \phi(v)$. We assume $0$ is the unique minimizer of $\phi$.

Let $\phi^\ast \colon \R^d \to \R$ be the dual function (or convex conjugate) of $\phi$, which is defined by

The supremum above is achieved at $v$ if $s = \nabla \phi(v)$, or equivalently, at $v = \nabla \phi^\ast(s)$. Thus, $\nabla \phi$ and $\nabla \phi^\ast$ are inverses to each other.

Now we define the optimizer vector field $\M_{f,\phi} \colon \X \to \R^d$ by

The optimality condition for the optimization problem above is $\nabla f(x) + \nabla \phi(v) = 0$, or equivalently, $v = \nabla \phi^\ast(-\nabla f(x))$. Since $\phi$ is symmetric, $\phi^\ast$ is symmetric as well, so $\nabla \phi^\ast(-s) = -\nabla \phi^\ast(s)$. Thus, we can write the optimizer vector field as:

or more succinctly, $\M_{f,\phi} = -\nabla \phi^\ast \circ \nabla f$.


Gradient and proximal steps

If $\v = \M_{f,\phi}$ is an optimizer vector field, then its forward and backward Euler operators have nice explicit representations, given by the gradient step and proximal step, respectively. This is classical, similar to how the forward Euler method corresponding to gradient flow (in the case $\phi(v) = \frac{1}{2}|v|^2$) is the gradient descent algorithm, whereas the backward Euler method is the proximal point algorithm. But in this more general setting, with an arbitrary regularizer $\phi$, we see the interesting behavior that the scale $\delta$ appears in the perspective function of $\phi$.

Concretely, for $\v = \M_{f,\phi} = -\nabla \phi^\ast \circ \nabla f$ we can write the forward Euler operator as:

Indeed, by differentiating the objective function on the right hand side above with respect to $y$, we find that the optimality condition is $\nabla f(x) + \nabla\phi(\frac{y-x}{\delta}) = 0$, or equivalently $y = x - \delta \nabla \phi^\ast(\nabla f(x))$, as desired. We can also write this in a more similar form to the optimizer vector field as follows:

where $\M_{f,\phi,\delta} \colon \X \to \R^d$ is the optimizer vector field at scale $\delta$:

In particular, notice that the optimizer vector field in continuous time is the case $\delta = 1$, $\M_{f,\phi} = \M_{f,\phi,1}$. But in discrete time, the scale $\delta$ modifies the regularizer $\phi$ into the perspective function $\delta \phi(\frac{v}{\delta})$.

Similarly, still for the optimizer vector field $\v = \M_{f,\phi}$, we can write the backward Euler operator as:

Indeed, the optimality condition for the minimization problem above is $\nabla f(y) + \nabla \phi(\frac{y-x}{\delta}) = 0$, or equivalently $x = y + \delta \nabla \phi^\ast(\nabla f(y)) = (\I + \delta \nabla \phi^\ast \circ \nabla f)(y) = \F_{-\delta,\v}(y)$, so $y = (\F_{-\delta,\v})^{-1}(x) = \B_{\delta,\v}(x)$ as desired. Here $\delta$ again still appears in the perspective function of $\phi$. But notice the difference with forward Euler: Here we are minimizing the actual objective function $f$, regularized by the perspective function of $\phi$, while in the forward Euler above we are minimizing a first-order approximation to $f$. We can also try to write it in vector field form:

but now $\tilde \M_{f,\phi,\delta} \colon \X \to \R^d$ is given by:

so it is no longer simply related to the optimizer vector field $\M_{f,\phi}$, like forward Euler above.


On the scale of $\delta$

What is interesting from the consideration above is that it shows how the scale of the time step $\delta$ appears in the discrete-time algorithm, depending on the homogeneity property of the regularizer $\phi$.

Concretely, suppose $\phi$ is homogeneous of order $p \ge 1$, which means

Then the perspective function in either the forward or backward Euler method above becomes:

Thus, we can write the forward Euler method as:

while we can write the backward Euler method as:

where in both algorithms, the step size $\epsilon$ is related to the time step $\delta$ via:

For example, consider the $p$-th power of the $\ell_2$-norm regularizer:

which is homogeneous of order $p$. The dynamics with respect to the optimizer vector field for this function is the rescaled gradient flow:

and the scaling $\epsilon = \delta^{p-1}$ above is consistent with what we found in analyzing the continuous-time limit of the $p$-th order gradient method (which uses the $(p-1)^{\text{st}}$-order Taylor approximation to $f$, rather than just first-order as above), see Section 3.3 here. Moreover, the scaling $\epsilon = \delta^{p-1}$ is also necessary to be able to derive a matching convergence rates, of $O(1/t^{p-1})$ in continuous time and $O(1/(\epsilon k^{p-1}))$ in discrete time, since $t = \delta k$, so $t^{p-1} = \delta^{p-1} k^{p-1} = \epsilon k^{p-1}$.

I think this suggests there is a more fundamental “calculus” for determining how the time step $\delta$ in the discretization interacts with the step size $\epsilon$ in the algorithm (as I tried to illustrate in Section 2.3 here).