A comment on Andrew Ng’s explanation of the role of regularization

I started watching AN’s coursera course on Machine learning, a field I want to get into hopefully in the near future. The clarity of Andrew’s presentation kept me watching one video after another. His example of quadratic versus quartic model for housing prices provided good motivation for penalization terms in regression models to reduce overfitting. But when introducing a general shrinkage term of the form \lambda \sum_i \theta_i^2, where \theta_i‘s are the polynomial regression coefficients, he seems to suggest it’s very complicated to understand how keeping the parameters small yield simpler models. Indeed penalizing all the coefficients is different from just a subset of pre-selected ones. Here is how I would explain this phenomenon in the case of polynomial regression: given a set of say 10 points (x_i, y_i), that roughly look like a quadratic curve, one can of course fit a 9th degree polynomial to match all 10 points. But doing so would require very large coefficients because the points are really not growing as fast as a typical 9th degree polynomial, thus one requires a lot of cancellation in polynomial terms to create something that locally looks like a small perturbation of a quadratic. This easily increases the total absolute value size of all the coefficients, or \sum_i \theta_i^2. For those who have not seen this phenomenon in analytical estimates: consider the power series for sin 100 x = \sum_{i=0}^\infty (-1)^i x^{2i+1} 100^{2i+1}/ (2i+1)!. The individual coefficients are quite big, especially for small degree as I chose the intentionally vicious multiplier 100, whose powers are not easily dominated by the factorial denominators. But we know well that \sin 100 x \in [-1,1]. The reason is simply cancellation of plus and minus terms. Now to go a bit further, if we are just interested in approximating \sin 100 x for x \in [0, 1/100] to reasonable precision, we don’t need an infinite series. In fact we don’t even need an 100 degree polynomial. This problem is exactly scalable from that of approximating \sin x for x \in [0,1]. I would probably choose a cubic, that is, sin x \approx x - x^3/6. Presumably Tichnoff regularization would let us do that.


About aquazorcarson

math PhD at Stanford, studying probability
This entry was posted in Uncategorized. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s