I first read about this on hacker news, which I encourage more people to subscribe for quality content in the tech/science space. Many people have raised doubts (including Scott Aaronson whose Democritus blog I used to follow) about the validity of the proof, but few seem to be able to point out any flaw to date. I found it generally well-written in the beginning, except some of the preliminary results seem too basic, such as the conversion between CNF and DNF, such as Lemma 1 and Lemma 2. The writing style reminds me of my REU project during summer 2007, when I wrote my first original research papers: the balance between non-triviality and clarity is often a newbie’s struggle.
In any case, I was happy with most of the arguments until the proof of Theorem 2. I am not questioning the correctness of its statement since I haven’t had time to try to find a counterexample or come up with a proof independently. The assertion, “By construction, no variable can be removed from without destroying this property”, however, does seem wrong to me. Here is a simple counterexample:
Take , , , and , where implicit product means and () and sign means or (). Furthermore let and . Then is clearly an -clause but not a prime one, since alone is an -clause.
Now I am not sure if this step is critical to the rest of the proof of Theorem 2, but I am suitably discouraged at this point to put in more salvaging time. This also highlights another complaint of mine, which is the unnecessarily complicated notation , which I feel can be replaced by something more standard and light-handed.
Motivated by some optimization problem in quantitative finance as well as simple curiosity, I started looking into some word-of-mouth ML related algorithms and various useful libraries to solve large scale constrained optimization problem. Perhaps my understanding of optimization has deepened over the years, and the newly bought green matcha tea bag has wrought wonder to my head, the documentation in the open source community with regard to the various numerical libraries seemed exceedingly clear. Here I will simply share a lightly annotated laundry list of all the useful tidbits picked up in 2 hours of distracted self-study:
- Bayes point algorithm: I first heard about this through a colleague’s work, where the main selling point seems to be the ability to control the model directly through the example weight, rather than feature weight. The definite reference seems to be this 34 page paper, which is clear on a sober afternoon, but can be quite daunting if one simply wants to scan through it to pick up the main idea. I haven’t finished it, but the first 5 pages look quite sensible and promise good return on time spent. My current speculative understanding is that this is a mixture of ideas from support vector machines, which focus on frequentist style optimization problem, and Bayesian inference, which is more expensive but has the nice “anytime” property, meaning even a non-converged model is useful. One thing I found funny was that the authors talked about Cox’s theorem on objective probability; not sure if it is really necessary in a technical paper like this, but authors are certainly allowed to digress a little philosophically.
- Bayesian linear regression: I learned this mainly through the namesake wikipedia article. Don’t look at the multivariate version, since it distracts you from the main point. The idea is to have a prior of some distribution for the weights (i.e., linear regressors), which can be conveniently chosen to be Gaussian (a conjugate prior). Then the posterior will have some Gaussian distribution whose mean and variance depend on the data as well as the prior. The formula presented towards the end indeed shows that if the prior is highly concentrated near its mean (high confidence), then the posterior distribution will lean towards the prior mean.
- Gauss Newton method: I thought I knew Newton’s method since grade school. Turns out even a veteran like me can get abysmally confused about the distinction between solving an equation and optimizing an objective function. So I spent a few minutes wondering why Newton’s method is considered a second order method, though to be fair, the label of second order has nothing to do with the use of 2nd derivatives, but mainly with the quadratic convergence rate. To those equally uninitiated readers, to a (vector-valued) equation of the form will typically have isolated solutions only when and are of the same dimension (assumed both to be in some Euclidean space). Otherwise you get a sub-variety as your solution, which numerical analysts and engineers typically don’t care for. Similarly optimization only makes sense when the objective function is real-valued, rather than vector-valued. By taking the gradient, one converts the later type of problem to the former. In any rate, I started looking up Gauss Newton, which isn’t exactly Newton Ralphson, after seeing this line in the implementation note of the trf library, which by the way, is extremely well written and makes me understand trust region reflective algorithm in one sitting. In it, the author mentions that one can approximate the Hessian matrix of a nonlinear objective function with . This looked vaguely similar to the typical linear regression exact solution, and in fact is related. As long as the objective function is a sum of squares of individual components, the GN algorithm works, but approximating Hessian with first order derivatives. This obviously can speed up things a lot.
- fmin_slsqp function in scipy: I found out about this mainly through this blog post. Upon looking at the implementation on github, I got a bit dismayed since a heavy portion of the code is done using python while and for loops. But I keep telling myself not to prematurely optimize, so maybe this will be my first library of choice. The underlying Sequential Least SQuares Programming approach looks somewhat quadratic.
- least_squares implementation in scipy: This one looks more promising in terms of performance, in particular it uses the trf library mentioned below, which promises to be appropriate for large scale problems.
- trust region reflective algorithm in scipy: the implementation note for trf above is quite good. The essential idea seems to be to treat a constrained optimization problem locally as a quadratic programming problem and bake the inequality constraints into the objective. Then reshape the region of optimization by the inverse of the diagonal matrix consisting of the distance to the boundaries in each direction (presumably the region is always convex so that this makes sense).
- Cauchy point: the solution of the gradient descent multiplier to maximize descent under the constraint that the independent variable doesn’t move beyond a certain radius. This seems related to Wolfe condition, but the latter guarantees convergence of gradient norm to 0, whereas Cauchy is just a way to cheaply optimize a single gradient descent step under constraint.
Today as usual I arrived late to company breakfast, due to morning busy work with kids. As soon as I finished ordering, the receptionist asked for my name. So with some nasal stuffiness, I said John. After several repetition, he finally wrote down “Jong”. I had no idea how he came up with this highly unusual spelling. Maybe it is somewhat common among Koreans? Maybe he himself is also Korean? I then corrected his spelling. Just like other receptionists before him, he eventually figured it out and re-uttered my name with a theatrically condescending tone. Outraged by my inability to get it right the first time, I retorted: “how should I pronounce my name?” He then smiled and conceded that he had bad hearing. But surely there is something wrong with the way I handled the nasality of the word “John”. It’s one of those American words that has no dictionary perfect pitch.
My voice tends to lack the resounding quality of a leader, or even a domain expert. I attribute this not to my physical inadequacy, but a general lack of confidence. When I utter a sentence, it usually has not been completely thought out. Even if it has, my mind can vacillate mid-air. Throughout my higher education I have over-emphasized depth and originality of ideas and neglected presentation. It takes considerable deliberation to present a piece of information in a socially convincing manner, no matter how trivial it is. Indeed, great speakers tend to over-sell mundane ideas, over and over, without boring or embarrassing the audience. I might have missed a critical lesson for not going through the brutality of academic job search, which requires an inordinate amount of salesmanship. So as a stage II corporate parasite, I must voluntarily allocate quality time to re-establish my character independence.
Today a colleague of mine brought up an interesting mathematical statistics problem. What is the Jensen Shannon divergence between two multinomial distributions? In the discussion, he mentioned that he has reduced the problem to looking at binomial distributions. I misheard it as Bernoulli distribution, and started wondering what’s the name of the multinomial analogue of that. Surely it is not called multinomial distribution, since the latter deals with objects, rather than 1.
Finally I read about Rademacher distribution, which is nothing but a rescaled version of Bernoulli distribution. Outraged by the excessive naming in mathematics, I started looking up the former curiosity.
According to wikipedia, the correct generalizing nomenclature is categorical distribution, or multinoulli distribution. I have never heard of multinoulli before, but the etymology is self-explanatory and Bernoulli seems respectable enough to coin a new word based on his name. The most natural Chinese translation becomes “how much effort?”. In fact, if one restricts to multinommial where , then it’s the same as multinoulli distribution.
Back to the colleague’s problem: recall Jensen-Shannon is defined by
. For and $\latex v = multinomial(m, \beta)$, JSD doesn’t make sense unless , which we assume. and are vectors that sum to , and both of the same dimension , otherwise again it doesn’t make sense.
There are different points in the common state space. We can simply calculate the probability under and of each point. Taking the case of $k=2$, we are then dealing with the special case of binomial distribution. The calculation eventually reduces to a sum of the form
. Mathematica suggests that this is not reducible to closed form in terms of common special functions. I suspected that one could Taylor expand and get item-wise closed form. It is true that each term in the resulting expansion is summable in closed form, but summing them together becomes just as difficult. In fact with each , I end up with terms in this roundabout summation in mathematica. So I am finally convinced that this problem has no analytic solution.
Only after my second child was born did I realize that I have been using the less economic insurance plans all these years. Even during years without major medical events, my family members, myself included, make hospital visits pretty frequently. The ideal plan in that situation is EPO. But I have been using a highly subsidized version of PPO sponsored through my company for the past year, since it was the best advertised one and seems to essentially level the deductible with EPO. What I did not understand is that for things like child delivery, EPO charges a flat $250 rate as reported by the Chinese community, whereas the PPO plan accumulates a bill in excess of $20k, which still charges $2k to me after coinsurance. Unfortunately the details are in the fine-prints, and it’s not in the insurance company’s best interest to make them transparent. Lesson learned. I will have to trust Chinese source of information far more than the English ones, because the latter just suck with irrelevant details.