A comment on Andrew Ng’s explanation of the role of regularization

I started watching AN’s coursera course on Machine learning, a field I want to get into hopefully in the near future. The clarity of Andrew’s presentation kept me watching one video after another. His example of quadratic versus quartic model for housing prices provided good motivation for penalization terms in regression models to reduce overfitting. But when introducing a general shrinkage term of the form $\lambda \sum_i \theta_i^2$, where $\theta_i$‘s are the polynomial regression coefficients, he seems to suggest it’s very complicated to understand how keeping the parameters small yield simpler models. Indeed penalizing all the coefficients is different from just a subset of pre-selected ones. Here is how I would explain this phenomenon in the case of polynomial regression: given a set of say 10 points $(x_i, y_i)$, that roughly look like a quadratic curve, one can of course fit a 9th degree polynomial to match all 10 points. But doing so would require very large coefficients because the points are really not growing as fast as a typical 9th degree polynomial, thus one requires a lot of cancellation in polynomial terms to create something that locally looks like a small perturbation of a quadratic. This easily increases the total absolute value size of all the coefficients, or $\sum_i \theta_i^2$. For those who have not seen this phenomenon in analytical estimates: consider the power series for $sin 100 x = \sum_{i=0}^\infty (-1)^i x^{2i+1} 100^{2i+1}/ (2i+1)!$. The individual coefficients are quite big, especially for small degree as I chose the intentionally vicious multiplier $100$, whose powers are not easily dominated by the factorial denominators. But we know well that $\sin 100 x \in [-1,1]$. The reason is simply cancellation of plus and minus terms. Now to go a bit further, if we are just interested in approximating $\sin 100 x$ for $x \in [0, 1/100]$ to reasonable precision, we don’t need an infinite series. In fact we don’t even need an $100$ degree polynomial. This problem is exactly scalable from that of approximating $\sin x$ for $x \in [0,1]$. I would probably choose a cubic, that is, $sin x \approx x - x^3/6$. Presumably Tichnoff regularization would let us do that.