Excursion into the maximum of random variables

My first serious dabbling with maximum of random variables did not occur until after graduate school. During a fruitful collaboration with a local college professor on the optimal scheduling problem, with basic time unit being Gaussian, I had to ask myself the following basic question. What does the distribution of the maximum of two independent Gaussians look like? Strangely this question never seriously entered my radar during the first five years of studying probability, aside from some theoretical toying with Gumbel, Frechet, and Weibull distribution that arises in the context of extreme value theory, that is, when you take the maximum of n iid random variables and perform suitable scaling on it.

In many practical applications, however, the extreme value theory seems useless, because we really care about what happens when n = 2, 3. One would think that for something so well-studied as Gaussians, the distribution of $X \vee Y$ for two independent Gaussians $X, Y$ of not necessarily the same means and variances should be well documented. In fact, even the first two moments can be challenging to calculate. The earliest, and in fact the most widely cited reference for this problem is due to Charles E. Clark, The Greatest of a Finite Set of Random Variables, a condescending title in the same tradition as the old school statistics masters.

At the time of discovery, I was both excited and disappointed by the lack of more recent progress. For instance, the generalization to more than 2 independent variable case seems nowhere to be found, even though Clark and subsequent authors all resorted to the cheap workaround of taking the maximum of two variables at a time, and approximating the intermediate steps with Gaussians. This approximation is actually pretty crude, and the error accumulates pretty quickly as the number of bifurcation increases. Given that the moments of $X \vee Y$ already involves the cumulative normal distribution function, one can imagine that with 3 or more variables, the iterated anti-derivatives of the CDF’s are bound to appear, rendering the result not so elegant. When restricted to 2 normal variables, one could even get somewhat elegant formulas, that is, ones with at most the CDF’s, even for the correlated case.

Vexed by the unfriendliness of Gaussians in this business, I sought other better behaved distributions, such as the exponential, the Gamma, the negative binomial, and the beta distributions. None of these, however are closed under taking maximum, and that exhausts my familiar list of continuous distributions. So I had to resort to more exotic ones, such as the extreme value distributions. These by design are closed under maximum, since their CDF F satisfies $F(x;\alpha)^k = F(x;\alpha')$ for some $\alpha'= \alpha'(\alpha,k)$. Recall CDF $F$ is defined by $F(x) = P(X < x)$. This suggests that there are infinitely many such families of maximum-closed distributions on $\mathbb{R}$, namely take any seed CDF $F$, define $F(x;\alpha) = F(x)^\alpha$, which is well-defined for $\alpha > 0$. The extreme value distributions are all subsumed under something called generalized extreme value distribution, given by
$e^{-(1 + \xi x)^{-1/\xi}}$, where $x$ is allowed to have shifts and dilations. The different dilation corresponds to the iterative application of maximums.

The only drawback of using this heaven-made maximum-friendly distributions is that they are not convolution friendly! Here convolution is the operation on probability density/mass functions corresponding to the addition of two independent random variable instances of some distributions. For instance, the sum of two independent normal variables is again normal as is well-known, so is the sum of exponential and gamma. In general, if $f(;\alpha)$ is a finitely parametrized family of densities, it being closed under convolution entails
$\hat{f}(;\alpha) \hat{f}(;\beta) = \hat{f}(;\gamma)$, where $\hat{f}$ denotes the Fourier transform of $f$, in other words, the characteristic function $char(f) := \mathbb{E} e^{-it X}$ for an $f$-distributed random variable $X$. It’s easy to check that none of the extreme value distributions satisfies this property, otherwise they would have been almost as famous as Gaussians, since the maximum operation occurs so universally in science and engineering.

After many futile attempts, I was not able to find such a miracle distribution family that’s both closed under independent maximum and independent addition. I don’t have a non-existence proof either, nor is it clear to me what the conditions should be. For instance, the following object might be a good target for nonexistence proof:

a 1-parameter distribution $F_\alpha$ with smooth and everywhere nonzero density $f_\alpha$ on $\mathbb{R}$, such that $f_\alpha \ast f_\beta = f_\gamma$, and $F_\alpha F_\beta = F_\delta$.

This is a special case of the existence question posed here.

Some partial results did occur to me: for instance, I can show that $\mathbb{E} X \vee X' \leq 2 \mathbb{E} X$, therefore putting some constraint on $\delta$ for input $\alpha,\beta$. For instance, if $\alpha = \beta$, then $\delta < 2 \alpha$. Thus the natural number parameter version of the question posed in the mathoverflow thread above seems to be negative.

Finally I would reveal a misconception held for a long time, namely the false claim that $var(X \vee Y) \le (var X) \vee (var Y)$. This initiated the following even bolder claim that $var( \max\{X_i\}) \le \max \{ var (X_i)\}$. Yuval Peres and other gave a nice argument that the best bound one could hope for is $var( \max\{X_i\}) \le \sum var(X_i)$. Let $X_i$ be iid indicator functions of events of probability $\epsilon$. Then $\max \{X_i\} = 1$ with probability $n \epsilon + O(n^2 \epsilon^2)$, because the overlap of pairs of these events has probability of order $\epsilon^2 n^2$ and the overlap of more than two is of even lower order. Thus effectively this max random variable is the indicator function of an event of probability $p= n \epsilon$, whose variance is unfortunately $p ( 1-p) \approx p$. On the other hand, the bound $var( \max \{X_i\}) \le \sum var(X_i)$ is true because of the following reasoning:
let $Z_i := X_i 1_{X_i == \max\{X_j\}}$. Then $var(Z_i) < var(X_i)$. In fact $\mathbb{E}(Z_i) < \mathbb{E}\max\{X_i\}$. The rest is left as an easy exercise.

Is it true that Gaussians satisfy the false claim above? The answer is no, to my surprise. The plot below is the variance of the independent Gaussian maximum $\max\{N(0,1),N(\mu,1)\}$, with the same variance $1$ but different means. The horizontal axis is $\mu$. Most of our intuition for the smallness of variance of the maximum comes from the narrow region $\mu \approx 0$, that is, the iid case, where the variance is truly minuscule. The maximum is achieved however at around $\pm 1.9$, with a maximal variance of around $1.6$, much bigger than individual variances $var(X_i)$. On the other hand, it’s not as large as

$var(N(0,1)) + var( N( \mu, 1) )$,

mainly because of the higher order correction terms.

So why am I pursuing the maximum of Gaussians over and over? Recently at work I needed to estimate how much user click feedback is required to choose the optimal set of $k$ candidates from a pool of $N$. The click feedback I am basing the judgment on is click through rate (ctr), which is simply a ratio of number of clicks to views, hence when properly scaled sastifies the Central Limit Theorem. I simply want to know when to stop observing a particular candidate, when enough statistical confidence is gained, whence opportunities should be given to other candidates for better exploration. A reasonable grand objective function to optimize for is the confidence interval of the total ctr of the top $k$ candidates at the end of going over all $N$ candidates, in other words, I want to minimize the following quantity, where $c_{(i)}$ is the $i$th highest observed ctr.

$var(\sum_{i=1}^k c_{(i)})$.

There are various more sophisticated versions of this problem, in the context of exploration/exploitation (aka learning v. earning in the B-school parlance), or multi-armed bandit, with a vast generalization of the latter to so-called matroid Bandit, that covers a whole range of transition rules. Our problem certainly falls within the matroid bandit framework, when we put all rewards to be $0$, since I do not care immediate rewards, but only the final reward after the experiments are all done. The way matroid bandits or any other bandit problems are solved, is typically through something called UCB (upper confidence bound), meaning they switch horse when a certain confidence level is reached regarding the admissibility of the current candidate or candidate set. This approach can be proven to have asymptotically zero regret, a notion that characterizes optimality in this field. I am however concerned about the asymptotic-ness of these reasoning and would like to approach the optimal solution from a purely global perspective, namely optimizing for the final confidence bound. It may be a really tall order in the context of bandits, but for 0 immediate reward case this should have a chance.