## Markov chain mixing time bound — Doeblin technique

One of the earliest results (in fact the only one before the introduction of geometric techniques) on total variation mixing time analysis on the Kac random walk is done via the Doeblin condition in the original paper of Diaconis and Saloff-Coste. Here I want to give an elementary stand-alone account of the argument applicable for all ergodic Markov chains.

Recall an ergodic Markov chain ${(K,\pi)}$ is an operator (finite or infinite dimensional) together with a probability distribution on some state space ${S}$, with the property that with any starting distribution ${\mu}$ on ${S}$, ${\| \mu K^t - \pi\|_[{\text{TV}}}$ (to be defined below) converges to ${0}$ as ${t}$ tends to infinity. Here for simplicity we take ${t}$ to be discrete integer-valued. ${\pi}$ thus coincides with the so-called stationary distribution of ${K}$, which satisfies the property${\pi K = \pi}$. If the chain is not ergodic, ie., either periodic or reducible, and in the infinite ${S}$ case, not Harris recurrent, then the stationary distributions are not unique.

Recall one important motivation to study ergodic Markov chains is that they provide a way to approximately sample from their stationary distribution, which is typically hard to sample exactly. For instance, the distribution of the ferromagnetic Ising model has ${2^n}$ number of states on a lattice or graph of ${n}$ vertices. Its partition function (or sum of the unnormalized probability weights of all the states) is thus a sum of exponential number of items, a computational challenge.

Given a suitable markov chain model for sampling, one is naturally interested in understanding the rate of this convergence. Another application comes from natural randomization processes such as card-shuffling models, where the effectiveness of the processes need to be examined for reliable use.

Now the rate of convergence can be measured in many different ways. An analogy is how we measure goodness of fit for statistical estimation: the choice of mean square error, for instance, can be replaced by ${L^1}$ error, ${L^\infty}$ error etc. An arguably most natural measure for Markov chain convergence is the total variation distance, defined for two probability measures ${\mu}$ and ${\nu}$ by ${\|\mu - \nu\|_{\text{TV}} = \sup_{A \subset S\text{ Borel}} |\mu(A) - \nu(A)|}$. This is natural because the entire definition is in terms of measuring probability of events in the state space via the two measures involved. Furthermore the distance is always between ${0}$ and ${1}$, avoiding the issue of comparing ${\infty}$ against ${\infty}$ in the case of ${L^p}$ distance for ${p > 1}$.

One of the earliest techniques to estimate the total variation distance at a given time step ${T}$ (or equivalent, given a target distance, what’s a the smallest ${T}$ one needs), is developed by Doeblin in the 1950’s. Surprisingly this is still widely used in many cutting edge problems, for which a deeper analysis is either impossible or would be prohibitively brain-wrecking.

The technique states the following:

Suppose for some ${t > 0}$, ${\inf_{A \subset S} \mu K^t(A) \ge \epsilon \pi(A)}$, then ${\|\mu K^{st} - \pi\|_{\text{TV}} \le (1 - \epsilon)^s}$. Thus to apply it, one needs to find a sufficiently large unit time ${t}$ that results in a positive ${\epsilon}$ (clearly if ${\epsilon =0}$ the technique is useless).

Let me explain why the technique works. Given such a triple ${(K,t,\epsilon)}$, I can write ${K^t = \epsilon \hat{\pi} + R}$, where ${\hat{\pi}}$ is an uninteresting Markov operator defined by ${\nu \hat{\pi} f = \pi(f)}$ for any ${\nu}$ probability measure on ${S}$. Thus ${\hat{\pi}}$ takes any starting point on ${S}$ and go immediate to the stationary distribution in one step. ${R}$ is the left-over component of ${K^t}$. What’s important is that ${R}$ is a positive operator, in the sense that ${\nu R 1_A \ge 0}$ for any probability ${\nu}$ and Borel set ${A \subset S}$.

Furthermore notice that ${\hat{\pi} R = R \hat{\pi} = ( K^t - \epsilon \hat{\pi}) \hat{\pi} = (1 - \epsilon) \hat{\pi}}$. Thus ${R}$ is a sub-Markov kernel with ${pi}$ as a quasi-fixed point. This is enough algebraic felicity to yield the following:

${\mu K^{st}(A) - \pi(A) = \mu(\epsilon \hat{\pi} + R)^s(A) - \pi(A) = \mu R^s (A) + ( 1 - (1 - \epsilon)^s) \pi(A) - \pi(A) = \mu R^s(A) - ( 1 - \epsilon)^s \pi(A)}$. Now ${\mu R^s (A) \le \mu(S) (1 - \epsilon)^s}$. Therefore the last expression is ${\le (1 -\epsilon)^s (1 - \pi(A)) \le ( 1- \epsilon)^s}$. Since we do not know much about ${R}$ or ${A}$, there is not much room for further optimization in the last two inequalities.

Doeblin’s technique typically gives very poor upper bound on mixing time (equivalently total variation distance at a given time step). Typically for discrete Markov chains, the bound obtained is exponential in the size of the natural parameter (for Ising model that is the number of vertices in the underlying graph, not the number of states). However in continuous or countable state space chains, where upper bounds are simply non-existing, the technique can provide initial ice-breaking. A case in point is the Kac model, where virtually all common techniques fail. However when one analyzes the so-called Given’s rotation carefully and understand how a sequence of such rotations lead to any particular orthogonal matrix, the diffusivity condition of the technqiue can be fulfilled, albeit at a very large ${t}$ and small ${\epsilon}$.