A presentation should focus on a single story deeply, as opposed to be a panorama of all the work one has done so far. Today I want to demystify a fact recorded in Prof Lai and Haipeng’s book statistical methods and modelings in financial markets. Recall a linear regression model takes the form
where is the predictor values (p denotes the number of predictors and each is an -dimensional vector standing for the value of th predictor for each of samples. is assumed to be independent random noises with mean 0 variance 1. No Gaussian assumption is needed in the following discussion, nor identical distribution, but the first two moments need to be exactly the same, and the different ‘s need to be at least pairwise uncorrelated. Finally is the -dimensional vector regression coefficients, thought of as a weight vector for the different predictors (traits).
The equation (1) is taken to be the truth. Given this truth, we would like to estimate based on observed points , each with associated predictor values and response value . If the goal is to minimize the mean square error, namely , then the optimal is such that is the projection of onto the column space spanned by , which makes sense since the resulting estimated must be in their span, and the way to minimize distance (equivalently mean square error) with itself is to take projection by Euclidean geometry.
What bothered me for a bit was the claim that the estimator of the variance of is given by the following formula:
This seems funny because if we look at equation (1), it seems that the vector is always -dimensional, and if we take , then the estimator should be , regardless of the number of predictors .
The reason the above naive reasoning fails is precisely that we can’t take in this calculation.
Recall the formula for is given by
Alternatively, one can define the projection operator onto column space of : . This formula is very easy to derive if one writes things in coordinate form, i.e., with all the subscripts and all. However it seems hard to assign direct intuition to it without computation. This has to do with noncommutative nature of matrix multiplication, otherwise you would expect to be reduced to the identity matrix (which is the case when is orthogonal).
Now plugging (equation 3) as well as the original linear regression model relation of in terms of (equation 1) into the Mean square error formula, we get .
We see that the end result is the magnitude squared of the noise term projected onto the column space of ; notice we used . This plugging algebraic step is the hardest in deriving (2). The rest is simple probability computation:
Since is also a projection operator, this case onto the dimensional orthogonal complement of the column space of , aka the null space of the , we can write it as , where is an orthogonal matrix and is a diagonal matrix with a few ones at first followed by all 0’s. This amounts to saying I pick an orthonormal basis in and extend to an orthonormal basis in , and let be some rotation matrix that takes the standard basis vectors in to , then projection in this new coordinate simply looks like picking out the first components and zeroing out the rest. Now consider the following expectation:
, using pairwise uncorrelatedness of ‘s. Now by assumption, and the length squared of each column is invariant under multiplication by an orthogonal matrix. Hence , i.e., the dimension of the space .
Thus , which is why we need to divide by in the end instead of .
In the case , we get 0 denominator, which is not inconsistent because as long as has full rank (hence invertible), a perfect fit should be found, leaving no mean square error. Of course is undefined, so in this case we do not get a useful estimator of noise variance. Indeed as gets close to , the variance of this variance estimator gets bigger. Here for simplicity let’s assume are identical and ‘s are quadruply uncorrelated, meaning provided are pairwise different and . This is strictly stronger than pairwise uncorrelated-ness. Then it’s easy to verify that ,
by orthogonality of rows and columns of , which in particular eliminates the cross terms .
Thus the sum scales linearly in , hence the whole thing decreases as . The variance is ultimately controlled by the fourth moment of .