Bolthausen’s proof of Berry Esseen theorem

In this section we re-present Bolthausen’s original proof of Berry Esseen using Stein’s method and fill in some computational details. I hope to make the argument more transparent to people trying to learn Stein’s method and Berry-Esseen theorem. The letter {c} will denote a constant that varies from line to line, but {C} will be a fixed absolute constant.
The classical Berry-Esseen theorems gives bound between the normal distribution function {\Phi} and the cumulative distribution function of {S_n = \sum_{i=1}^n \frac{X_i}{\sqrt{n}}}, where {X_i} are iid with mean 0 variance 1 and {\mathbb{E} |X_i|^3 = \gamma}. It states

\displaystyle \sup_t | \Phi(t) - \mathbb{P}[S_n < t] | < \frac{C \gamma}{\sqrt{n}} \ \ \ \ \ (1)

where {C} is universal. Various efforts have been spent on finding the optimal {C}, which is certainly less than {10}. But we are only interested in the existence of such a {C}.
First we can restate (1)as follows. Define

\displaystyle \delta(\gamma,n) = \sup\{|\mathbb{E}(h_z(S_n)) - \Phi(h_z)|: z \in {\mathbb R}, (X_1, \ldots, X_n) \in \mathcal{L}(n,\gamma) \} \ \ \ \ \ (2)

where

  1. {\mathcal{L}(n,\gamma)} is the collection of all length {n} sequences of iid variables with prescribed first three absolute moments, {0},{1},and {\gamma} respectively.
  2. {h_z(x) = 1_{(-\infty, z]}(x)} is the indicator that {x \le z}, and
  3. {\Phi(h_z) = \int_{-\infty}^\infty h_z(x) \phi(x) dx}, where {\phi(x)} is the standard normal density function, is the same as {\Phi(z)}.

Then Berry-Esseen says

\displaystyle \sup\{\sqrt{n} \delta(\gamma,n) / \gamma: \gamma \ge 1, n \in {\mathbb N}\} < \infty. \ \ \ \ \ (3)

As a check of your understanding, the above supremum is bounded by {C} in (1).
Next we want to get a recursive bound on {\delta(\gamma,n), \delta(\gamma,n-1)} etc. This will be achieved by manipulating the additive structure {S_n = S_{n-1} \frac{\sqrt{n-1}}{\sqrt{n}} + \frac{X_n}{\sqrt{n}}}. In fact if we can prove

\displaystyle \delta(\gamma,n) \le c \gamma / \sqrt{n} + \delta(\gamma,n-1)/2, \ \ \ \ \ (4)

for some absolute constant {c}, then we have

\displaystyle \delta(\gamma,n) \le c\gamma/\sqrt{n} + (c\gamma/ \sqrt{n-1} + \delta(\gamma,n-2)/2)/2 \\ \le c\gamma /\sqrt{n} + c \gamma /2\sqrt{n-1} + c\gamma /4\sqrt{n-2} + \ldots \\ \le c \gamma/\sqrt{n} \ \ \ \ \ (5)

where we can choose a bigger c on the right hand side, because the geometric sum can be bounded by {c_2/\sqrt{n}}, for some c_2 > c (the geometric decay dominates the {k \mapsto \sqrt{n/(n-k)}} growth). To bound {\delta(\gamma,n)}, we will introduce some continuous approximations of the heavyside function {h_z}. We define

\displaystyle h_{z,\lambda}(x) = [(1 + (z-x)/\lambda) \wedge 1] \vee 0 \ \ \ \ \ (6)

where {\wedge} means taking minimum of the two sides, and {\vee} maximum. In words, {h_{z,\lambda}} is the same as {h_z} for {x \notin [z,z+\lambda]} and linearly interpolates {h(z)} and {h(z+\lambda)} in between. We also denote {h_{z,0} = h_z}, as the natural functional limit.
We also define the direct analogue of {\delta(\gamma,n)} in the {\lambda}-smoothed case,

\displaystyle \delta(\lambda,\gamma,n) = \sup\{|\mathbb{E}(h_{z,\lambda}(S_n)) - \Phi(h_{z,\lambda})|: z \in {\mathbb R}, (X_1, X_2, \ldots) \in \mathcal{L}(n,\gamma)\}. \ \ \ \ \ (7)

From the fact that {h_{z,0} \le h_{z,\lambda} \le h_{z+\lambda,0}} (try drawing their graphs using the linear interpolation interpretation), one gets

\displaystyle \delta(\gamma,n) \le \delta(\lambda,\gamma,n) + \lambda/\sqrt{2\pi}. \ \ \ \ \ (8)

This follows because if for a particular sequence {(X_1,X_2, \ldots) \in \mathcal{L}(n,\gamma)} and {z \in {\mathbb R}}, {\mathbb{E}(h_{z,0}(S_n)) \ge \Phi(h_{z,0})}, we get

\displaystyle \mathbb{E}(h_{z,0}(S_n)) - \Phi(h_{z,0}) \le \mathbb{E}(h_{z,\lambda}(S_n)) - \Phi(h_{z,0})\\ \le \mathbb{E}(h_{z,\lambda}(S_n)) - \Phi(h_{z,\lambda}) + \Phi(h_{z,\lambda}) - \Phi(h_{z,0})\\ \le \delta(\lambda,\gamma,n) + \lambda/\sqrt{2\pi} \ \ \ \ \ (9)

The last bound uses the fact that the normal density {\phi(x)} is bounded uniformly by {\frac{1}{\sqrt{2\pi}}}, and that {h_{z,\lambda} - h_z} is supported on the interval {[z,z+\lambda]} and bounded by {1}. The case {\mathbb{E}(h_{z,0}(S_n)) \le \Phi(h_{z,0})} is treated similarly. Hence taking the supremum over {\mathcal{L}(n,\gamma)} we get (8).
Now we can make further abstraction of (3) by defining {g_{z,\lambda}(x) = h_{z,\lambda}(x) - \Phi(h_{z,\lambda})}, then we can rewrite

\displaystyle \delta(\lambda,\gamma,n) = \sup\{|\mathbb{E} g_{z,\lambda}(S_n)|: z \in {\mathbb R}, (X_1,\ldots, X_n) \in \mathcal{L}(n,\gamma)\}. \ \ \ \ \ (10)

This might seem tautological so far, but here is where Stein’s method comes in, because {g} has the following nice represetation:

\displaystyle h_{z,\lambda}(x) - \Phi(h_{z,\lambda}) =: g_{z,\lambda}(x) = f'(x) - xf(x) \ \ \ \ \ (11)

where {f(x) = e^{x^2/2} \int_{-\infty}^x (h(z) - \Phi(h)) e^{-z^2/2} dz}. This is just elementary differentiation. Recall Stein’s method says that {X} has standard normal distribution if and only if for every smooth {f},

\displaystyle \mathbb{E} f'(X) - X f(X) = 0 \ \ \ \ \ (12)

Thus we are reducing the problem of bounding difference of distribution functions to that of bounding the associated Stein operator. From now on we will write {h} for {h_{z,\lambda}}, and when {\lambda = 0}, we will write out {h_z} explicitly.
Since {|h(y)- \Phi(h)| \le 1},

\displaystyle f(x) \le \phi(x)^{-1} \Phi(x). \ \ \ \ \ (13)

By write {\Phi(x) = \int_{-\infty}^x \frac{y}{y} e^{-y^2/2} / \sqrt{2\pi} dy}, we get

\displaystyle \Phi(x) \le \frac{1}{-x} \phi(x). \ \ \ \ \ (14)

Thus for {x \le -1}, {\Phi(x) \le \phi(x)}. For {x \in (-1,0]}, we have {\Phi(x) \le \Phi(0) = 1/2}, and {\phi(x) \ge \phi(-1) = \frac{1}{\sqrt{2\pi}} e^{-1/2}}. Therefore

\displaystyle \Phi(x) \le (1/2) \sqrt{2\pi} e^{1/2} \phi(x), \ \ \ \ \ (15)

for {x \in (-1,0]}. Together we have (15) for {x \le 0}. Therefore {|f(x)| \le (1/2) \sqrt{2\pi} e^{1/2}} (Bolthausen claims {|f(x)| \le 1}, but it doesn’t seem obvious and we don’t need it). Similarly, using (14), we get

\displaystyle |xf(x)| \le 1, \ \ \ \ \ (16)

which combined with (11) gives {|f'(x)| \le 2}. By symmetry, the above inequalities hold for {x > 0} as well. So in summary we have {|f(x)|}, {|xf(x)|}, and {|f'(x)|} are all universally bounded.
Also from the definition {h = h_{z,\lambda}}, we get by a simple change of variable

\displaystyle h(x+y) - h(x) = \frac{1}{\lambda} |[x,x+y] \cap [z,z+\lambda]| \\ = \frac{1}{\lambda} |\int_x^{x+y} 1_{[z,z+\lambda]} (u) du|\\ = \frac{|y|}{\lambda} \int_0^1 1_{[z,z+\lambda]}(x+sy) ds \ \ \ \ \ (17)

Note {[x,x+y]} could stand for {[x+y,x]} if {y < 0}.
Using Stein’s equation again we get

\displaystyle |f'(x+y) - f'(x)| = |(y+x)f(y+x) +h(y+x) - xf(x) - h(x)|\\ =|yf(x+y) + x(f(x+y) - f(x)) + h(x+y) - h(x)|\\ \le |y|(1 + 2|x| + \frac{1}{\lambda} \int_0^1 1_{[z,z+\lambda]}(x+sy) ds). \ \ \ \ \ (18)

Now recall we need to bound {\mathbb{E} g(S_n) = \mathbb{E} h(S_n) - \Phi(h)}. By Stein’s equation, this reduces to

\displaystyle \mathbb{E}(f'(S_n) - S_n f(S_n)) = \mathbb{E}(f'(S_n) - \sum_{i=1}^n \frac{X_i}{\sqrt{n}} f(S_n)) \\ = \mathbb{E}(f'(S_n) - \sqrt{n} X_n f(S_n) ) \ \ \ \ \ (19)

by linearity of expectation and exchangeability of {X_i}‘s; the form of {X_n f(S_n)} is chosen merely as a convenience. The same argument below works for non-iid {X_i} sequence as well.
We can write {\mathbb{E}\sqrt{n} X_n f(S_n)} as

\displaystyle \mathbb{E} X_n f(S_n) - X_n f(S'_{n-1}) = \mathbb{E} X_n \int_0^{X_n/\sqrt{n}} f(S'_{n-1} +u) du \\ = \mathbb{E} X_n^2 \int_0^1 f(S'_{n-1} + \frac{tX_n}{\sqrt{n}}) dt. \ \ \ \ \ (20)

where {S'_{n-1} = S_n - \frac{X_n}{\sqrt{n}}} (note the difference from {S_{n-1}}). Therefore (19)becomes

\displaystyle \mathbb{E}[f'(S_n) - f'(S'_{n-1}) - X_n^2 \int_0^1[f'(S'_{n-1} + \frac{t X_n}{\sqrt{n}}) - f'(S'_{n-1})] dt]. \ \ \ \ \ (21)

The next comes the crucial steps in establishing the recursive bound (4). We will bound {\mathbb{E}[f'(S_n) - f'(S'_{n-1})]} and the later integral separately:

  1. Using (18),

    \displaystyle \mathbb{E} | f'(S_n) - f'(S'_{n-1})| \le \mathbb{E} [ \frac{|X_n|}{\sqrt{n}} ( 1 + 2 |S'_{n-1}| + \frac{1}{\lambda} \int_0^1 1_{[z,z+\lambda]} ( S'_{n-1} + \frac{tX_n}{\sqrt{n}})dt)]. \ \ \ \ \ (22)

    We use two comparisons with the Gaussian distribution to get, for any constant {X_n},

    \displaystyle \mathbb{E}[ 1_{[z,z+\lambda]}(S'_{n-1} + \frac{tX_n}{\sqrt{n}})] = \mathbb{P} [ z - \frac{tX_n}{\sqrt{n}} \le S'_{n-1} \le z+ \lambda - \frac{t X_n}{\sqrt{n}}] \\ = \mathbb{P}[z \frac{\sqrt{n}}{\sqrt{n-1}} - \frac{t X_n}{\sqrt{n-1}} \le S_{n-1} \le z \frac{\sqrt{n}}{\sqrt{n-1}} + \lambda \frac{\sqrt{n}}{\sqrt{n-1}} - \frac{t X_n}{\sqrt{n-1}}]\\ \le \mathbb{E} h_{(z+\lambda) \frac{\sqrt{n}}{\sqrt{n-1}} - \frac{tX_n}{\sqrt{n-1}}} (S_{n-1}) - h_{z \frac{\sqrt{n}}{\sqrt{n-1}} - \frac{tX_n}{\sqrt{n-1}}} (S_{n-1}) \\ \le \Phi(h_{(z+\lambda) \frac{\sqrt{n}}{\sqrt{n-1}} - \frac{tX_n}{\sqrt{n-1}}}) + \delta(\gamma,n-1) - \Phi( h_{z \frac{\sqrt{n}}{\sqrt{n-1}} - \frac{tX_n}{\sqrt{n-1}}} (S_{n-1})) + \delta(\gamma,n-1)\\ \le \lambda \frac{\sqrt{n}}{\sqrt{n-1}} \frac{1}{\sqrt{2\pi}}+ 2\delta(\gamma,n-1). \ \ \ \ \ (23)

    Here we again use the uniform bound of the normal density. Observe that the final bound is in terms of {\delta(\gamma,n-1)} instead of {\delta(\lambda,\gamma,n-1)}. This is due to (17)and nicely relates {\delta(\lambda,\gamma,n)}‘s to {\delta(\gamma,n)}‘s. Also note that

    \displaystyle \mathbb{E} |X_n| |S'_{n-1}| \le \mathbb{E} X_n^2 \mathbb{E}[(S'_{n-1})^2] \\ \le \frac{n}{n-1} \le 2 \ \ \ \ \ (24)

    for {n \ge 2}.
    Thus by conditioning first on {X_n}, we get

    \displaystyle \mathbb{E}|f'(S_n) - f'(S'_{n-1})| \le \frac{\mathbb{E} X_n^2 }{\sqrt{n}} + \frac{2}{\sqrt{n}} +\\ \frac{1}{\lambda} \int_0^1 [\lambda \frac{\sqrt{n}}{\sqrt{n-1}} \frac{1}{\sqrt{2\pi}}+ 2\delta(\gamma,n-1)] dt\\ =\frac{\mathbb{E} X_n^2 }{\sqrt{n}} + \frac{2}{\sqrt{n}} + \frac{\sqrt{n}}{\sqrt{n-1}} \frac{1}{\sqrt{2\pi}}+ 2\delta(\gamma,n-1)/\lambda \\ \le \frac{c}{\sqrt{n}} (1 + \delta(\gamma, n-1)/ \lambda). \ \ \ \ \ (25)

  2. Now we bound the second term in (19) similarly, using (18)again:

    \displaystyle \mathbb{E} | X_n^2 \int_0^1 [f'(S_{n-1} + \frac{t X_n}{\sqrt{n}}) - f'(S_{n-1})] dt| \le \mathbb{E}[ |\frac{X_n^3}{\sqrt{n}}| (1 + 2|S_{n-1}| )]\\ + \mathbb{E} [\frac{X^2}{\sqrt{n}}\frac{1}{\lambda} \int_0^1 1_{[z,z+\lambda]}(S'_{n-1}+s\frac{X_n}{\sqrt{n}}) ds]\\ \le \frac{c \gamma}{\sqrt{n}} + \frac{1}{\sqrt{n}} \mathbb{E} [X_n^2]\frac{1}{\lambda} \int_0^1 [\lambda \frac{\sqrt{n}}{\sqrt{n-1}} \frac{1}{\sqrt{2\pi}}+ 2\delta(\gamma,n-1)] dt \\ \le \frac{c \gamma}{\sqrt{n}}(1 + \delta(\gamma,n-1)/\lambda). \ \ \ \ \ (26)

    This is the only place where {\mathbb{E}|X_n|^3 = \gamma} is used.

So in sum, we have proved that

\displaystyle \mathbb{E}(|f'(S_n) - S_n f(S_n)|) \le \frac{c}{\sqrt{n}}(1+ \delta(\gamma,n-1)/\lambda) + \frac{c\gamma}{\sqrt{n}} (1 + \delta(\gamma,n-1)/\lambda) \\ \le \frac{c\gamma}{\sqrt{n}} (1+ \delta(\gamma,n-1)/\lambda), \ \ \ \ \ (27)

because {\gamma \ge 1} by Holder’s inequality. Recalling (10), and the relation between {f} and {h}, we can take the supremum over all {h_{z,\lambda}} for a fixed {\lambda}, and {(X_1, \ldots, X_n) \in \mathcal{L}(n,\gamma)}, to get

\displaystyle \delta(\lambda,\gamma,n) = \sup_{z,(X_1, \ldots, X_n)} \mathbb{E}|f'(S_n) - S_n f(S_n)| \\ \le \frac{c\gamma}{\sqrt{n}} (1+ \delta(\gamma,n-1)/\lambda). \ \ \ \ \ (28)

Finally using the bound (8)between {\delta(\gamma,n)} and {\delta(\lambda,\gamma,n)}, we can relate {\delta(\gamma,n)} with {\delta(\gamma,n-1)}:

\displaystyle \delta(\gamma,n) \le \frac{c \gamma}{\sqrt{n}}(1 + \delta(\gamma,n-1)/\lambda) + \lambda/\sqrt{2\pi}. \ \ \ \ \ (29)

It remains then to optimize over {\lambda}. Choosing {\lambda = \frac{2c\gamma}{\sqrt{n}}}, we arrive at (4), which concludes the proof of Berry-Esseen’s theorem.

Advertisements

About aquazorcarson

math PhD at Stanford, studying probability
This entry was posted in Uncategorized. Bookmark the permalink.

2 Responses to Bolthausen’s proof of Berry Esseen theorem

  1. Michael says:

    Though most of the steps are very detailed (thanks a lot for this, the proof in Bolthausens’s paper is quite hard to grasp at a first glance), I can’t see why eq (5) should be true. We need
    sum(2^(-k)/sqrt(n-k), k, 0, n-1) infty, but where does the bound on the right-hand side come from? When I computed the expression for several n, I found that the bound seems to be quite sharp. Can you give me a hint on why it is true?

    • Dear Michael,
      Consider first the sequence 1 + 1/2 + 1/4 + 1/8 + .. This converges to 2. Now consider \frac{1}{\sqrt{n}} + \frac{1}{2 \sqrt{n-1}} + \frac{1}{4 \sqrt{n-2}} + \frac{1}{8 \sqrt{n-3}} + \ldots . I can rewrite the latter as \frac{1}{\sqrt{n}} [ 1 + \frac{1}{2} \sqrt{n/(n-1)} + \frac{1}{4} \sqrt{n/(n-2)} + \ldots] . The geometric series dominates the growth of \sqrt{n/(n-k)}, for instance, you can write \sqrt{n/(n-k)} \le n/(n-k) \le 1 + k. So we know that the latter series is summable because \int_0^\infty 2^{-x} (1+x) dx < \infty. Thus for any c on the left hand side of eq (5) I can choose a bigger c on the right hand side.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s