Gradient flow and gradient descent

$\def \R {\mathbb R} \def \X {\mathcal X} \def \N {\mathbb N} \def \Z {\mathbb Z} \def \A {\mathcal A} \def \E {\mathcal E}$ In the beginning there is Time. Time moves, and creates dynamics on Space.

Continuous and discrete time

In continuous time, time flows continuously at a constant unit rate:

\[t \in \R_0 = [0, +\infty)\]

whereas in discrete time, time ticks at a discrete time step:

\[k \in \Z_0 = \{0, 1, 2, \dots \}\]

We can also view discrete time as a discretization of continuous time at some time step $\delta > 0$:

\[t = \delta k\]

and we typically think of the time step $\delta$ to be small (infinitesimal, so low-order approximations suffice).

Dynamics in continuous time = algorithms in discrete time

Let $\X$ be space. We assume $\X = \R^d$ for simplicity, but more generally $\X$ is a complete manifold.

A dynamics on $\X$ is specified via a differential equation:

\[\dot X_t = F(X_t)\]

where $F \colon \X \to \R^d$ is a vector field that assigns the velocity vector $F(x)$ to each point $x$ in space, which specifies where the point should go next (in the “instantaneous future”).

So a dynamics $X = (X_t)$, $t \in \R_0$, starts at some initial position $X_0 \in \X$ at time $t = 0$ and flows following the differential equation above, that is, by setting the velocity $\dot X_t$ at each time $t$ equal to the value of the vector field $F$ at the current position $X_t$.

But concretely, if at time $t$ we are at point $X_t$ and know our velocity is $\dot X_t$, where should we go next? The problem is there is no “next” in continuous time. Instead, the instantaneous velocity vector only tells us where we will approximately be in the near future:

\[X_{t+\delta} = X_t + \delta \dot X_t + O(\delta^2)\]

So if at time $t$ our velocity is $\dot X_t = F(X_t)$, then our best guess for where we should be at time $t+\delta$ is

\[X_{t+\delta} = X_t + \delta F(X_t)\]

that is, we scale the velocity vector by $\delta$.

This gives us a way to simulate (or discretize) the dynamics, by starting at some initial position $X_0 \in \X$ at time $t = 0$ and using the approximation above every $\delta$ time step, which should work well for small $\delta$.

So if we define the sequence

\[x_k := X_{\delta k}\]

then we obtain an algorithm in discrete time that is defined iteratively via the update rule

\[x_{k+1} = x_k + \delta F(x_k)\]

Thus, we can view algorithms in discrete time as a discretization of dynamics in continuous time, or dynamics as the continuous-time limit ($\delta \to 0$) of algorithms. Equivalently, dynamics in continuous time are templates that we can implement as algorithms in discrete time (which is what we ultimately care about) via some systematic discretization methods that maintain some provable guarantees.

Gradient flow and gradient descent

The prototypical example we have in mind is the gradient flow dynamics in continuous time:

\[\dot X_t = -\nabla f(X_t)\]

and the corresponding gradient descent algorithm in discrete time:

\[x_{k+1} = x_k - \epsilon \nabla f(x_k)\]

where we recall from last time that $\;f \colon \X \to \R$ is a convex objective function we wish to minimize. Note that the step size $\epsilon > 0$ here plays the same role as the time step $\delta$ from above, so we have $\epsilon = \delta$.

Gradient flow and gradient descent are generated by the negative gradient vector field:

\[F(x) = -\nabla f(x)\]

which is the direction of steepest descent, i.e., we can write it as the solution to the following optimization problem using a regularized linear approximation to the objective function:

\[-\nabla f(x) = \arg\min_{v \in \R^d} \;\left\{ \langle \nabla f(x), v \rangle + \frac{1}{2} \|v\|^2 \right\}\]

Thus, gradient flow is a greedy method in continuous time that flows following the (locally-optimal) steepest descent direction. In particular, gradient flow is a descent method, which means it always decreases the objective function:

\[\frac{d}{dt} \; f(X_t) = \langle \nabla f(X_t), \dot X_t \rangle = -\|\nabla f(X_t)\|^2 \le 0\]

This is a natural property for optimization: Since our goal is to minimize $f$, it makes sense to decrease $f$ as much as possible at each time. Thus, gradient flow is a simple and intuitive method for convex optimization in continuous time.

Similarly, gradient descent is a greedy algorithm in discrete time:

\[x_{k+1} = x_k - \epsilon \nabla f(x_k) = \arg\min_{x \in \X} \;\left\{ f(x_k) + \langle \nabla f(x_k), x-x_k \rangle + \frac{1}{2 \epsilon} \|x-x_k\|^2 \right\}\]

Note how the step size $\epsilon > 0$ becomes the weight $1/\epsilon$ multiplying the quadratic regularization term (equivalently, multiplying the gradient $\nabla f(x_k)$ by a factor of $\epsilon$).

Gradient descent is also a descent method, but (in contrast to continuous time) we need an additional assumption that the objective function $f$ be $L$-smooth (i.e., $\nabla f$ is $L$-Lipschitz, or $|\nabla^2 f(x)| \le L$) and the step size $\epsilon$ be less than $2/L$; see for example (1.2.12) of Nesterov’s book. Moreover, the optimal step size choice is $\epsilon = 1/L$, which we use.

So in discrete time we make the additional assumption that $f$ is $(1/\epsilon)$-smooth.

As we shall see, this is a pattern that persists to a great generality: Optimization in continuous and discrete time exhibit nice matching parallel structures (such as convergence rates), but the proofs in discrete time always require additional assumptions (such as smoothness). These assumptions are the “price to pay” to implement the dynamics from continuous to discrete time while maintaining the convergence guarantees in the resulting algorithms.

We illustrate this below with the simplest example of gradient flow and gradient descent.

Remark: The discretization we use here is the forward-Euler (explicit) method, which yields the gradient descent algorithm. Another common discretization is the backward-Euler (implicit) method, which yields the proximal gradient algorithm; see for example Section 4.1 of this monograph by Parikh and Boyd.

Matching convergence rates: Smooth case

In continuous time, gradient flow is a descent method, so $f(X_t)$ always decreases and converges. When $f$ is convex (and differentiable), it turns out we can prove an explicit convergence rate.

Theorem: Whenever $f$ is convex, gradient flow has convergence rate
\[f(X_t) - f(x^*) \le \frac{\|X_0-x^*\|^2}{t} = O\left(\frac{1}{t}\right)\]

Proof: Consider the “energy functional”

\[\E_t \equiv \E(X_t) := \frac{1}{f(X_t) - f(x^*)}\]

Under the gradient flow dynamics $\dot X_t = -\nabla f(X_t)$, the rate of change of $\E_t$ is

\[\dot{\E}_t := \frac{d}{dt} \; \E_t = \left(\frac{\|\nabla f(X_t)\|}{f(X_t) - f(x^*)}\right)^2 \ge \frac{1}{\|X_0-x^*\|^2}\]

where the inequality follows from the convexity of $f$ and Cauchy-Schwarz inequality, and the fact that gradient flow is a descent method. Thus, the energy $\E_t$ is increasing at least linearly, which implies the claim from the definition of $\E_t$. $\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\quad\square$

The proof above matches the discrete-time proof below, and it will generalize to rescaled gradient flow later. There are many other proofs, including the following, which will generalize to natural gradient flow.

Alternate proof: Show that the following alternate energy functional

\[\tilde \E_t = t(f(X_t) - f(x^*)) + \frac{1}{2} \|X_t - x^*\|^2\]

is a decreasing function of time (i.e., $\dot{\tilde \E}_t \le 0$, so $\tilde \E_t$ is a Lyapunov function). $\qquad\qquad\qquad\qquad\square$

In discrete time, we can prove a matching convergence rate for gradient descent under an additional smoothness assumption.

Theorem: If $f$ is convex and $(1/\epsilon)$-smooth, then gradient descent has convergence rate
\[f(x_k) - f(x^*) \le \frac{4\|x_0-x^*\|^2}{\epsilon k} = O\left(\frac{1}{\epsilon k}\right)\]

Proof: As in the continuous-time proof, the key idea is to show that the “energy functional”

\[\E_k \equiv \E(x_k) := \frac{1}{\Delta_k}\]

where $\Delta_k := f(x_k) - f(x^\ast)$ is the function value residual, is increasing at least linearly, that is:

\[\E_{k+1} \ge \E_k + \frac{\epsilon}{4\|x_0-x^*\|^2}\]

However, to prove this we need to show a finer guarantee on the decrease of the residual:

\[\Delta_{k+1} \le \Delta_k - \frac{\epsilon \Delta_k^2}{4 \|x_0-x^*\|^2}\]

and this is where we need the assumption that $f$ is $(1/\epsilon)$-smooth. $\qquad\qquad\qquad\qquad\qquad\quad\square$

Note: Recall from above that the step size $\epsilon$ and time step $\delta$ are related via $\epsilon = \delta$, and continuous time $t$ and discrete time $k$ are related via $t = \delta k$, so $t = \epsilon k$. Thus, the $O(1/\epsilon k)$ convergence rate in discrete time really matches the $O(1/t)$ convergence rate in continuous time, albeit with an additional smoothness assumption in discrete time.

Note also that the $(1/\epsilon)$-smoothness assumption vanishes in the $\epsilon \to 0$ limit (since $1/\epsilon \to \infty$ so the upper bound becomes vacuous), so in continuous time we indeed recover a matching convergence rate under no assumption other than convexity.

Matching linear convergence rates: Strongly convex case

We now consider the case when $f$ is strongly convex, which means it is lower bounded by a quadratic:

\[f(x) \ge f(y) + \langle \nabla f(y), x-y \rangle + \frac{\sigma}{2} \|x-y\|^2\]

for some $\sigma > 0$ (and without loss of generality we can set $\sigma = 1$). An equivalent characterization of strong convexity that will be convenient for us is the following (see (2.1.19) of Nesterov ‘04):

\[f(x) \le f(y) + \langle \nabla f(y), x-y \rangle + \frac{1}{2\sigma} \|\nabla f(x)-\nabla f(y)\|^2\]

In particular, when $y=x^\ast$ we have $\nabla f(x^\ast) = 0$, so we get an upper bound on the function residual in terms of the squared norm of the gradient:

\[f(x) - f(x^*) \le \frac{1}{2\sigma} \|\nabla f(x)\|^2\]

This implies the following linear convergence rate for gradient flow under strong convexity.

Theorem: If $f$ is $\sigma$-strongly convex, then gradient flow has convergence rate
\[f(X_t) - f(x^*) \le (f(X_0) - f(x^*)) \exp(-2\sigma t) = O(\exp(-2\sigma t))\]

Proof: Computing the time derivative of the residual $f(X_t) - f(x^\ast)$ and using the bound above, we get

\[\frac{d}{dt}\; (f(X_t) - f(x^*)) = -\|\nabla f(X_t)\|^2 \le -2\sigma (f(X_t) - f(x^*))\]

which implies the desired linear convergence rate after dividing and integrating both sides. $\qquad\quad\square$

In discrete time, we can also prove a linear rate for gradient descent under smoothness and strong convexity.

Theorem: If $f$ is $(1/\epsilon)$-smooth and $\sigma$-strongly convex, then gradient descent has convergence rate
\[f(x_k) - f(x^*) \le \frac{f(x_0)-f(x^*)}{\left(1+\kappa\right)^k} = O\left(\exp\left(-\kappa k\right)\right)\]
where $\kappa = \epsilon \sigma$ is the (inverse) condition number, which we assume is small.

Proof: We first use the $(1/\epsilon)$-smoothness of $f$ to show that

\[\langle \nabla f(x_{k+1}), x_k - x_{k+1} \rangle \ge \frac{\epsilon}{2} \|\nabla f(x_{k+1})\|^2\]

The left-hand side is bounded above by $f(x_k) - f(x_{k+1}) = \Delta_k - \Delta_{k+1}$ by the convexity of $f$. The right-hand side is bounded below by $\epsilon \sigma (f(x_{k+1}) - f(x^*)) = \kappa \Delta_{k+1}$ by the strong convexity of $f$. Thus,

\[\Delta_{k+1} - \Delta_k \le -\kappa \Delta_{k+1}\]

which means the residual is decreasing multiplicatively at each iteration:

\[\Delta_{k+1} \le \frac{\Delta_k}{1+\kappa}\]

The desired linear convergence rate then follows by unrolling the recursion above. $\qquad\qquad\qquad\;\square$

So in the strongly convex case we still have matching linear convergence rates in continuous and discrete time. However, note that the rate in continuous time is the strong convexity constant $\sigma$, whereas the rate in discrete time is the condition number $\kappa = \epsilon \sigma$, which also requires the smoothness assumption. But nevertheless, we have seen the nice matching parallel structure of optimization in continuous and discrete time, as embodied by gradient flow and gradient descent.

The origin: Part 2