Noise variance estimator for Linear regressions

A presentation should focus on a single story deeply, as opposed to be a panorama of all the work one has done so far. Today I want to demystify a fact recorded in Prof Lai and Haipeng’s book statistical methods and modelings in financial markets. Recall a linear regression model takes the form

${Y = X \beta + \epsilon}$, (1)

where ${X = (X_1, \ldots, X_p)}$ is the predictor values (p denotes the number of predictors and each ${X_i}$ is an ${n}$-dimensional vector standing for the value of ${i}$th predictor for each of ${n}$ samples. ${\epsilon}$ is assumed to be independent random noises with mean 0 variance 1. No Gaussian assumption is needed in the following discussion, nor identical distribution, but the first two moments need to be exactly the same, and the different ${\epsilon_i}$‘s need to be at least pairwise uncorrelated. Finally ${\beta}$ is the ${p}$-dimensional vector regression coefficients, thought of as a weight vector for the ${p}$ different predictors (traits).

The equation (1) is taken to be the truth. Given this truth, we would like to estimate ${\beta}$ based on observed points ${q_1, \ldots, q_n}$, each ${q_i}$ with associated predictor values ${X_{1,i}, X_{2,i}, \ldots, X_{p,i}}$ and response value ${Y_i}$. If the goal is to minimize the mean square error, namely ${\|Y - X \beta\|^2 = \sum_{i=1}^n ( y_i - \sum_{j=1}^p x_{i,j}\beta_j)^2}$, then the optimal ${\hat{\beta}}$ is such that ${X \hat{\beta}}$ is the projection of ${Y}$ onto the column space spanned by ${X_1, \ldots, X_p}$, which makes sense since the resulting estimated ${\hat{Y}}$ must be in their span, and the way to minimize distance (equivalently mean square error) with ${Y}$ itself is to take projection by Euclidean geometry.

What bothered me for a bit was the claim that the estimator of the variance of ${\epsilon}$ is given by the following formula:

${\frac{1}{n-p} \| Y - X \hat{\beta} \|^2}$. (2)

This seems funny because if we look at equation (1), it seems that the ${\epsilon}$ vector is always ${n}$-dimensional, and if we take ${\hat{\beta}= \beta}$, then the estimator should be ${\frac{1}{n} \|Y - X \hat{\beta}\|^2}$, regardless of the number of predictors ${p}$.

The reason the above naive reasoning fails is precisely that we can’t take ${\hat{\beta} = \beta}$ in this calculation.

Recall the formula for ${\hat{\beta}}$ is given by

${(X^T X)^{-1} X^T Y }$. (3)

Alternatively, one can define the projection operator onto column space of ${X}$: ${H := X (X^T X)^{-1} X^T}$. This formula is very easy to derive if one writes things in coordinate form, i.e., with all the subscripts and all. However it seems hard to assign direct intuition to it without computation. This has to do with noncommutative nature of matrix multiplication, otherwise you would expect ${H}$ to be reduced to the identity matrix (which is the case when ${X}$ is orthogonal).

Now plugging ${\hat{\beta}}$ (equation 3) as well as the original linear regression model relation of ${Y}$ in terms of ${X}$ (equation 1) into the Mean square error formula, we get ${\|Y - X \hat{\beta}\|^2 = \|(X \beta + \epsilon - X (X^T X)^{-1} X^T (X \beta + \epsilon)\|^2 = \|\epsilon - X(X^T X)^{-1} X^T \epsilon \|^2 }$.

We see that the end result is the magnitude squared of the noise term projected onto the column space of ${X}$; notice we used ${(X^T X)^{-1} X^T X = I}$. This plugging algebraic step is the hardest in deriving (2). The rest is simple probability computation:

Since ${I - X(X^T X)^{-1} X^T}$ is also a projection operator, this case onto the ${n-p}$ dimensional orthogonal complement ${N}$ of the column space of ${X}$, aka the null space of the ${H}$, we can write it as ${H' = U \Lambda U^{-1}}$, where ${U}$ is an orthogonal matrix and ${\Lambda}$ is a diagonal matrix with a few ones at first followed by all 0’s. This amounts to saying I pick an orthonormal basis in ${N}$ and extend to an orthonormal basis ${\mathcal{B}}$ in ${{\mathbb R}^n}$, and let ${U}$ be some rotation matrix that takes the standard basis vectors in ${{\mathbb R}^n}$ to ${\mathcal{B}}$, then projection in this new coordinate simply looks like picking out the first ${n-p}$ components and zeroing out the rest. Now consider the following expectation:

${\mathop{\mathbb E} \| (H')\epsilon\|^2 = \mathop{\mathbb E} \sum_i (\sum_j H'_{i,j} \epsilon_j )^2 = \sum_i \sum_j (H')_{i,j}^2 \mathop{\mathbb E} \epsilon_j^2}$, using pairwise uncorrelatedness of ${\epsilon_i}$‘s. Now ${\mathop{\mathbb E} \epsilon_j^2 \equiv 1}$ by assumption, and the length squared of each column ${\|H'_i\|^2 := \sum_j (H')_{i,j}^2}$ is invariant under multiplication by an orthogonal matrix. Hence ${\sum_{i,j} (H')_{i,j}^2 = \sum_{i,j} \Lambda_{i,j}^2 = n-p}$, i.e., the dimension of the space ${N}$.

Thus ${\|Y - X \hat{\beta}\|^2 = (n-p) \mathop{\mathbb E} \epsilon_i^2}$, which is why we need to divide by ${n-p}$ in the end instead of ${n}$.

In the case ${n =p}$, we get 0 denominator, which is not inconsistent because as long as ${X}$ has full rank (hence invertible), a perfect fit ${\beta}$ should be found, leaving no mean square error. Of course ${\frac{0}{0}}$ is undefined, so in this case we do not get a useful estimator of noise variance. Indeed as ${p}$ gets close to ${n}$, the variance of this variance estimator gets bigger. Here for simplicity let’s assume ${\mathop{\mathbb E} \epsilon_j^4}$ are identical and ${\epsilon_i}$‘s are quadruply uncorrelated, meaning ${\mathop{\mathbb E} \epsilon_{i_1}^{\alpha_1} \ldots \epsilon_{i_4}^{\alpha_4} = \mathop{\mathbb E}\epsilon_{i_1}^{\alpha_1} \ldots \mathop{\mathbb E} \epsilon_{i_4}^{\alpha_4}}$ provided ${i_1,\ldots, i_4}$ are pairwise different and ${\sum \alpha_i \le 4}$. This is strictly stronger than pairwise uncorrelated-ness. Then it’s easy to verify that ${\text{var} (n-p)^{-1}\|H' \epsilon \|^2 = \sum_{i_1, i_2, j} (H')_{i_1,j}^2 (H')_{i_2,j}^2 \mathop{\mathbb E} \epsilon_j^4 / (n-p)^2}$,

by orthogonality of rows and columns of ${H'}$, which in particular eliminates the cross terms ${\mathop{\mathbb E} H'_{i_{1,1}j_1} H'_{i_{1,2}j_2} H'_{i_{2,1}j_1} H'_{i_{2,2}j_2} \epsilon_{j_1} \epsilon_{j_2}}$.

Thus the sum scales linearly in ${n-p}$, hence the whole thing decreases as ${O((n-p)^{-1})}$. The variance is ultimately controlled by the fourth moment of ${\epsilon_i}$.