## Bolthausen’s proof of Berry Esseen theorem

In this section we re-present Bolthausen’s original proof of Berry Esseen using Stein’s method and fill in some computational details. I hope to make the argument more transparent to people trying to learn Stein’s method and Berry-Esseen theorem. The letter ${c}$ will denote a constant that varies from line to line, but ${C}$ will be a fixed absolute constant.
The classical Berry-Esseen theorems gives bound between the normal distribution function ${\Phi}$ and the cumulative distribution function of ${S_n = \sum_{i=1}^n \frac{X_i}{\sqrt{n}}}$, where ${X_i}$ are iid with mean 0 variance 1 and ${\mathbb{E} |X_i|^3 = \gamma}$. It states

$\displaystyle \sup_t | \Phi(t) - \mathbb{P}[S_n < t] | < \frac{C \gamma}{\sqrt{n}} \ \ \ \ \ (1)$

where ${C}$ is universal. Various efforts have been spent on finding the optimal ${C}$, which is certainly less than ${10}$. But we are only interested in the existence of such a ${C}$.
First we can restate (1)as follows. Define

$\displaystyle \delta(\gamma,n) = \sup\{|\mathbb{E}(h_z(S_n)) - \Phi(h_z)|: z \in {\mathbb R}, (X_1, \ldots, X_n) \in \mathcal{L}(n,\gamma) \} \ \ \ \ \ (2)$

where

1. ${\mathcal{L}(n,\gamma)}$ is the collection of all length ${n}$ sequences of iid variables with prescribed first three absolute moments, ${0}$,${1}$,and ${\gamma}$ respectively.
2. ${h_z(x) = 1_{(-\infty, z]}(x)}$ is the indicator that ${x \le z}$, and
3. ${\Phi(h_z) = \int_{-\infty}^\infty h_z(x) \phi(x) dx}$, where ${\phi(x)}$ is the standard normal density function, is the same as ${\Phi(z)}$.

Then Berry-Esseen says

$\displaystyle \sup\{\sqrt{n} \delta(\gamma,n) / \gamma: \gamma \ge 1, n \in {\mathbb N}\} < \infty. \ \ \ \ \ (3)$

As a check of your understanding, the above supremum is bounded by ${C}$ in (1).
Next we want to get a recursive bound on ${\delta(\gamma,n), \delta(\gamma,n-1)}$ etc. This will be achieved by manipulating the additive structure ${S_n = S_{n-1} \frac{\sqrt{n-1}}{\sqrt{n}} + \frac{X_n}{\sqrt{n}}}$. In fact if we can prove

$\displaystyle \delta(\gamma,n) \le c \gamma / \sqrt{n} + \delta(\gamma,n-1)/2, \ \ \ \ \ (4)$

for some absolute constant ${c}$, then we have

$\displaystyle \delta(\gamma,n) \le c\gamma/\sqrt{n} + (c\gamma/ \sqrt{n-1} + \delta(\gamma,n-2)/2)/2 \\ \le c\gamma /\sqrt{n} + c \gamma /2\sqrt{n-1} + c\gamma /4\sqrt{n-2} + \ldots \\ \le c \gamma/\sqrt{n} \ \ \ \ \ (5)$

where we can choose a bigger $c$ on the right hand side, because the geometric sum can be bounded by ${c_2/\sqrt{n}}$, for some $c_2 > c$ (the geometric decay dominates the ${k \mapsto \sqrt{n/(n-k)}}$ growth). To bound ${\delta(\gamma,n)}$, we will introduce some continuous approximations of the heavyside function ${h_z}$. We define

$\displaystyle h_{z,\lambda}(x) = [(1 + (z-x)/\lambda) \wedge 1] \vee 0 \ \ \ \ \ (6)$

where ${\wedge}$ means taking minimum of the two sides, and ${\vee}$ maximum. In words, ${h_{z,\lambda}}$ is the same as ${h_z}$ for ${x \notin [z,z+\lambda]}$ and linearly interpolates ${h(z)}$ and ${h(z+\lambda)}$ in between. We also denote ${h_{z,0} = h_z}$, as the natural functional limit.
We also define the direct analogue of ${\delta(\gamma,n)}$ in the ${\lambda}$-smoothed case,

$\displaystyle \delta(\lambda,\gamma,n) = \sup\{|\mathbb{E}(h_{z,\lambda}(S_n)) - \Phi(h_{z,\lambda})|: z \in {\mathbb R}, (X_1, X_2, \ldots) \in \mathcal{L}(n,\gamma)\}. \ \ \ \ \ (7)$

From the fact that ${h_{z,0} \le h_{z,\lambda} \le h_{z+\lambda,0}}$ (try drawing their graphs using the linear interpolation interpretation), one gets

$\displaystyle \delta(\gamma,n) \le \delta(\lambda,\gamma,n) + \lambda/\sqrt{2\pi}. \ \ \ \ \ (8)$

This follows because if for a particular sequence ${(X_1,X_2, \ldots) \in \mathcal{L}(n,\gamma)}$ and ${z \in {\mathbb R}}$, ${\mathbb{E}(h_{z,0}(S_n)) \ge \Phi(h_{z,0})}$, we get

$\displaystyle \mathbb{E}(h_{z,0}(S_n)) - \Phi(h_{z,0}) \le \mathbb{E}(h_{z,\lambda}(S_n)) - \Phi(h_{z,0})\\ \le \mathbb{E}(h_{z,\lambda}(S_n)) - \Phi(h_{z,\lambda}) + \Phi(h_{z,\lambda}) - \Phi(h_{z,0})\\ \le \delta(\lambda,\gamma,n) + \lambda/\sqrt{2\pi} \ \ \ \ \ (9)$

The last bound uses the fact that the normal density ${\phi(x)}$ is bounded uniformly by ${\frac{1}{\sqrt{2\pi}}}$, and that ${h_{z,\lambda} - h_z}$ is supported on the interval ${[z,z+\lambda]}$ and bounded by ${1}$. The case ${\mathbb{E}(h_{z,0}(S_n)) \le \Phi(h_{z,0})}$ is treated similarly. Hence taking the supremum over ${\mathcal{L}(n,\gamma)}$ we get (8).
Now we can make further abstraction of (3) by defining ${g_{z,\lambda}(x) = h_{z,\lambda}(x) - \Phi(h_{z,\lambda})}$, then we can rewrite

$\displaystyle \delta(\lambda,\gamma,n) = \sup\{|\mathbb{E} g_{z,\lambda}(S_n)|: z \in {\mathbb R}, (X_1,\ldots, X_n) \in \mathcal{L}(n,\gamma)\}. \ \ \ \ \ (10)$

This might seem tautological so far, but here is where Stein’s method comes in, because ${g}$ has the following nice represetation:

$\displaystyle h_{z,\lambda}(x) - \Phi(h_{z,\lambda}) =: g_{z,\lambda}(x) = f'(x) - xf(x) \ \ \ \ \ (11)$

where ${f(x) = e^{x^2/2} \int_{-\infty}^x (h(z) - \Phi(h)) e^{-z^2/2} dz}$. This is just elementary differentiation. Recall Stein’s method says that ${X}$ has standard normal distribution if and only if for every smooth ${f}$,

$\displaystyle \mathbb{E} f'(X) - X f(X) = 0 \ \ \ \ \ (12)$

Thus we are reducing the problem of bounding difference of distribution functions to that of bounding the associated Stein operator. From now on we will write ${h}$ for ${h_{z,\lambda}}$, and when ${\lambda = 0}$, we will write out ${h_z}$ explicitly.
Since ${|h(y)- \Phi(h)| \le 1}$,

$\displaystyle f(x) \le \phi(x)^{-1} \Phi(x). \ \ \ \ \ (13)$

By write ${\Phi(x) = \int_{-\infty}^x \frac{y}{y} e^{-y^2/2} / \sqrt{2\pi} dy}$, we get

$\displaystyle \Phi(x) \le \frac{1}{-x} \phi(x). \ \ \ \ \ (14)$

Thus for ${x \le -1}$, ${\Phi(x) \le \phi(x)}$. For ${x \in (-1,0]}$, we have ${\Phi(x) \le \Phi(0) = 1/2}$, and ${\phi(x) \ge \phi(-1) = \frac{1}{\sqrt{2\pi}} e^{-1/2}}$. Therefore

$\displaystyle \Phi(x) \le (1/2) \sqrt{2\pi} e^{1/2} \phi(x), \ \ \ \ \ (15)$

for ${x \in (-1,0]}$. Together we have (15) for ${x \le 0}$. Therefore ${|f(x)| \le (1/2) \sqrt{2\pi} e^{1/2}}$ (Bolthausen claims ${|f(x)| \le 1}$, but it doesn’t seem obvious and we don’t need it). Similarly, using (14), we get

$\displaystyle |xf(x)| \le 1, \ \ \ \ \ (16)$

which combined with (11) gives ${|f'(x)| \le 2}$. By symmetry, the above inequalities hold for ${x > 0}$ as well. So in summary we have ${|f(x)|}$, ${|xf(x)|}$, and ${|f'(x)|}$ are all universally bounded.
Also from the definition ${h = h_{z,\lambda}}$, we get by a simple change of variable

$\displaystyle h(x+y) - h(x) = \frac{1}{\lambda} |[x,x+y] \cap [z,z+\lambda]| \\ = \frac{1}{\lambda} |\int_x^{x+y} 1_{[z,z+\lambda]} (u) du|\\ = \frac{|y|}{\lambda} \int_0^1 1_{[z,z+\lambda]}(x+sy) ds \ \ \ \ \ (17)$

Note ${[x,x+y]}$ could stand for ${[x+y,x]}$ if ${y < 0}$.
Using Stein’s equation again we get

$\displaystyle |f'(x+y) - f'(x)| = |(y+x)f(y+x) +h(y+x) - xf(x) - h(x)|\\ =|yf(x+y) + x(f(x+y) - f(x)) + h(x+y) - h(x)|\\ \le |y|(1 + 2|x| + \frac{1}{\lambda} \int_0^1 1_{[z,z+\lambda]}(x+sy) ds). \ \ \ \ \ (18)$

Now recall we need to bound ${\mathbb{E} g(S_n) = \mathbb{E} h(S_n) - \Phi(h)}$. By Stein’s equation, this reduces to

$\displaystyle \mathbb{E}(f'(S_n) - S_n f(S_n)) = \mathbb{E}(f'(S_n) - \sum_{i=1}^n \frac{X_i}{\sqrt{n}} f(S_n)) \\ = \mathbb{E}(f'(S_n) - \sqrt{n} X_n f(S_n) ) \ \ \ \ \ (19)$

by linearity of expectation and exchangeability of ${X_i}$‘s; the form of ${X_n f(S_n)}$ is chosen merely as a convenience. The same argument below works for non-iid ${X_i}$ sequence as well.
We can write ${\mathbb{E}\sqrt{n} X_n f(S_n)}$ as

$\displaystyle \mathbb{E} X_n f(S_n) - X_n f(S'_{n-1}) = \mathbb{E} X_n \int_0^{X_n/\sqrt{n}} f(S'_{n-1} +u) du \\ = \mathbb{E} X_n^2 \int_0^1 f(S'_{n-1} + \frac{tX_n}{\sqrt{n}}) dt. \ \ \ \ \ (20)$

where ${S'_{n-1} = S_n - \frac{X_n}{\sqrt{n}}}$ (note the difference from ${S_{n-1}}$). Therefore (19)becomes

$\displaystyle \mathbb{E}[f'(S_n) - f'(S'_{n-1}) - X_n^2 \int_0^1[f'(S'_{n-1} + \frac{t X_n}{\sqrt{n}}) - f'(S'_{n-1})] dt]. \ \ \ \ \ (21)$

The next comes the crucial steps in establishing the recursive bound (4). We will bound ${\mathbb{E}[f'(S_n) - f'(S'_{n-1})]}$ and the later integral separately:

1. Using (18),

$\displaystyle \mathbb{E} | f'(S_n) - f'(S'_{n-1})| \le \mathbb{E} [ \frac{|X_n|}{\sqrt{n}} ( 1 + 2 |S'_{n-1}| + \frac{1}{\lambda} \int_0^1 1_{[z,z+\lambda]} ( S'_{n-1} + \frac{tX_n}{\sqrt{n}})dt)]. \ \ \ \ \ (22)$

We use two comparisons with the Gaussian distribution to get, for any constant ${X_n}$,

$\displaystyle \mathbb{E}[ 1_{[z,z+\lambda]}(S'_{n-1} + \frac{tX_n}{\sqrt{n}})] = \mathbb{P} [ z - \frac{tX_n}{\sqrt{n}} \le S'_{n-1} \le z+ \lambda - \frac{t X_n}{\sqrt{n}}] \\ = \mathbb{P}[z \frac{\sqrt{n}}{\sqrt{n-1}} - \frac{t X_n}{\sqrt{n-1}} \le S_{n-1} \le z \frac{\sqrt{n}}{\sqrt{n-1}} + \lambda \frac{\sqrt{n}}{\sqrt{n-1}} - \frac{t X_n}{\sqrt{n-1}}]\\ \le \mathbb{E} h_{(z+\lambda) \frac{\sqrt{n}}{\sqrt{n-1}} - \frac{tX_n}{\sqrt{n-1}}} (S_{n-1}) - h_{z \frac{\sqrt{n}}{\sqrt{n-1}} - \frac{tX_n}{\sqrt{n-1}}} (S_{n-1}) \\ \le \Phi(h_{(z+\lambda) \frac{\sqrt{n}}{\sqrt{n-1}} - \frac{tX_n}{\sqrt{n-1}}}) + \delta(\gamma,n-1) - \Phi( h_{z \frac{\sqrt{n}}{\sqrt{n-1}} - \frac{tX_n}{\sqrt{n-1}}} (S_{n-1})) + \delta(\gamma,n-1)\\ \le \lambda \frac{\sqrt{n}}{\sqrt{n-1}} \frac{1}{\sqrt{2\pi}}+ 2\delta(\gamma,n-1). \ \ \ \ \ (23)$

Here we again use the uniform bound of the normal density. Observe that the final bound is in terms of ${\delta(\gamma,n-1)}$ instead of ${\delta(\lambda,\gamma,n-1)}$. This is due to (17)and nicely relates ${\delta(\lambda,\gamma,n)}$‘s to ${\delta(\gamma,n)}$‘s. Also note that

$\displaystyle \mathbb{E} |X_n| |S'_{n-1}| \le \mathbb{E} X_n^2 \mathbb{E}[(S'_{n-1})^2] \\ \le \frac{n}{n-1} \le 2 \ \ \ \ \ (24)$

for ${n \ge 2}$.
Thus by conditioning first on ${X_n}$, we get

$\displaystyle \mathbb{E}|f'(S_n) - f'(S'_{n-1})| \le \frac{\mathbb{E} X_n^2 }{\sqrt{n}} + \frac{2}{\sqrt{n}} +\\ \frac{1}{\lambda} \int_0^1 [\lambda \frac{\sqrt{n}}{\sqrt{n-1}} \frac{1}{\sqrt{2\pi}}+ 2\delta(\gamma,n-1)] dt\\ =\frac{\mathbb{E} X_n^2 }{\sqrt{n}} + \frac{2}{\sqrt{n}} + \frac{\sqrt{n}}{\sqrt{n-1}} \frac{1}{\sqrt{2\pi}}+ 2\delta(\gamma,n-1)/\lambda \\ \le \frac{c}{\sqrt{n}} (1 + \delta(\gamma, n-1)/ \lambda). \ \ \ \ \ (25)$

2. Now we bound the second term in (19) similarly, using (18)again:

$\displaystyle \mathbb{E} | X_n^2 \int_0^1 [f'(S_{n-1} + \frac{t X_n}{\sqrt{n}}) - f'(S_{n-1})] dt| \le \mathbb{E}[ |\frac{X_n^3}{\sqrt{n}}| (1 + 2|S_{n-1}| )]\\ + \mathbb{E} [\frac{X^2}{\sqrt{n}}\frac{1}{\lambda} \int_0^1 1_{[z,z+\lambda]}(S'_{n-1}+s\frac{X_n}{\sqrt{n}}) ds]\\ \le \frac{c \gamma}{\sqrt{n}} + \frac{1}{\sqrt{n}} \mathbb{E} [X_n^2]\frac{1}{\lambda} \int_0^1 [\lambda \frac{\sqrt{n}}{\sqrt{n-1}} \frac{1}{\sqrt{2\pi}}+ 2\delta(\gamma,n-1)] dt \\ \le \frac{c \gamma}{\sqrt{n}}(1 + \delta(\gamma,n-1)/\lambda). \ \ \ \ \ (26)$

This is the only place where ${\mathbb{E}|X_n|^3 = \gamma}$ is used.

So in sum, we have proved that

$\displaystyle \mathbb{E}(|f'(S_n) - S_n f(S_n)|) \le \frac{c}{\sqrt{n}}(1+ \delta(\gamma,n-1)/\lambda) + \frac{c\gamma}{\sqrt{n}} (1 + \delta(\gamma,n-1)/\lambda) \\ \le \frac{c\gamma}{\sqrt{n}} (1+ \delta(\gamma,n-1)/\lambda), \ \ \ \ \ (27)$

because ${\gamma \ge 1}$ by Holder’s inequality. Recalling (10), and the relation between ${f}$ and ${h}$, we can take the supremum over all ${h_{z,\lambda}}$ for a fixed ${\lambda}$, and ${(X_1, \ldots, X_n) \in \mathcal{L}(n,\gamma)}$, to get

$\displaystyle \delta(\lambda,\gamma,n) = \sup_{z,(X_1, \ldots, X_n)} \mathbb{E}|f'(S_n) - S_n f(S_n)| \\ \le \frac{c\gamma}{\sqrt{n}} (1+ \delta(\gamma,n-1)/\lambda). \ \ \ \ \ (28)$

Finally using the bound (8)between ${\delta(\gamma,n)}$ and ${\delta(\lambda,\gamma,n)}$, we can relate ${\delta(\gamma,n)}$ with ${\delta(\gamma,n-1)}$:

$\displaystyle \delta(\gamma,n) \le \frac{c \gamma}{\sqrt{n}}(1 + \delta(\gamma,n-1)/\lambda) + \lambda/\sqrt{2\pi}. \ \ \ \ \ (29)$

It remains then to optimize over ${\lambda}$. Choosing ${\lambda = \frac{2c\gamma}{\sqrt{n}}}$, we arrive at (4), which concludes the proof of Berry-Esseen’s theorem.

Consider first the sequence $1 + 1/2 + 1/4 + 1/8 + ..$ This converges to 2. Now consider $\frac{1}{\sqrt{n}} + \frac{1}{2 \sqrt{n-1}} + \frac{1}{4 \sqrt{n-2}} + \frac{1}{8 \sqrt{n-3}} + \ldots$. I can rewrite the latter as $\frac{1}{\sqrt{n}} [ 1 + \frac{1}{2} \sqrt{n/(n-1)} + \frac{1}{4} \sqrt{n/(n-2)} + \ldots]$. The geometric series dominates the growth of $\sqrt{n/(n-k)}$, for instance, you can write $\sqrt{n/(n-k)} \le n/(n-k) \le 1 + k$. So we know that the latter series is summable because $\int_0^\infty 2^{-x} (1+x) dx < \infty$. Thus for any $c$ on the left hand side of eq (5) I can choose a bigger $c$ on the right hand side.