## Buggy day

Being a software engineer means that it’s a constant struggle between depression and complacence. At the peak of either extremes, there is also the need for periodic introspection. Hands-on people, as are typical of engineers, tend to be negligent of more nurturing and routine things in life. These things are not as rewarding or exciting, but can make a difference in shaping us as human beings, how we perceive ourselves, how we want to chart the course of life, and our influence on people all around. One excuse we often give ourselves is that there is no time. Indeed, as a working class member, I have to take kids to school, do all the usual chores, as well as stay in office from 9 to 5, in addition to 1.5 hours on the road. But if we do not spend some quality time daily to nurture our soul, we run into the risk of leading a completely meaningless life, devoid of substance and purpose, and not only is this leading to a sad terminal state, but may also interfere traumatically as we drag our bodies towards that end state.

As we age, the amount of competitive pressure around us naturally rises. Also on the rise is the sense that our experience has enabled us to stay ahead of the game indefinitely. Even in my early 30s, I can feel that I am gravitating towards the same mental trap that millions ahead of me have experienced, the notion that a superpower, in the form of automated intelligence at my fingertip,  has dawned upon me that makes me invincible for life. Quite the contrary, this infatuation with and over-romanticizing of superpower is the death knell of the biological and sympathetic side of a person. The ease with which the said superpower is acquired should be caution enough against over-reliance on it. Unfortunately people often find the easy ways in life and follow their so-called passion without considering the context. The obvious thing to do is not always the right thing to do. The context is also very important. During the medieval age, a scribe is a very respectable job that has the status of a professor in today’s day and age, while nowadays a typist cannot even make a living, because virtually everyone is capable of that kind of skill. What truly distinguishes a person is some unique skill rarely seen in the mass. Unfortunately, the sheer multiplicative quantity of human beings has rendered such redeeming anomalies less and less likely over time, especially under the wave of globalization. So we often dial back to a second order competitive advantage, fostering a good habit by being persistent about it. This is a necessity, but alone does not lead to the age-old pursuit of life-long happiness. After all humans are conditioned to appreciate change (hopefully in the positive direction) in fortune, rather than an eternal possession thereof.

Habit that takes the form of robotic and thoughtless actions tends to degrade us as humans and contribute little to our characters. It merely enslaves our mind and numbs our sense of righteousness and ability to articulate. Habit that involves creative, presentational, and perhaps even mildly confrontational episodes tends to provide more utility as we chart through the difficult course of inter-personal relational quagmire at work, at home, and in the society at large. A scientifically trivial act of diary writing, for instance, juggles our mind in a spontaneous direction, with the collision of diction and emotion freeing us from the confine of algorithmic precision, a constraint imposed by our silicon nemesis to accelerate the process of intellectual polarization and squeeze the last breath out of an osmotic soul. Thus human life must be variegated, unanticipated, serendipitous, and original. To follow any predestined path is to defy the will of the Creator, which leads to misery by definition.

## On Norbert Blum’s latest proof that P != NP

I first read about this on hacker news, which I encourage more people to subscribe for quality content in the tech/science space. Many people have raised doubts (including Scott Aaronson whose Democritus blog I used to follow) about the validity of the proof, but few seem to be able to point out any flaw to date. I found it generally well-written in the beginning, except some of the preliminary results seem too basic, such as the conversion between CNF and DNF, such as Lemma 1 and Lemma 2. The writing style reminds me of my REU project during summer 2007, when I wrote my first original research papers: the balance between non-triviality and clarity is often a newbie’s struggle.

In any case, I was happy with most of the arguments until the proof of Theorem 2. I am not questioning the correctness of its statement since I haven’t had time to try to find a counterexample or come up with a proof independently. The assertion, “By construction, no variable can be removed from $c_\ell$ without destroying this property”, however, does seem wrong to me. Here is a simple counterexample:

Take $p_1 = x_1 x_2 x_5$, $p_2 = x_2 x_3$, $m = x_1 x_4$, and $c_\ell = x_2 + x_5$, where implicit product means and ($\wedge$) and $+$ sign means or ($\vee$). Furthermore let $f = p_1 + p2$ and $res_\beta(g_0) = f + m$. Then $c_\ell$ is clearly an $f$-clause but not a prime one, since $x_2$ alone is an $f$-clause.

Now I am not sure if this step is critical to the rest of the proof of Theorem 2, but I am suitably discouraged at this point to put in more salvaging time. This also highlights another complaint of mine, which is the unnecessarily complicated notation $res_\beta(g_0)$, which I feel can be replaced by something more standard and light-handed.

## A caffeinated adventure into optimization algorithms and numerical solver libraries in python

Motivated by some optimization problem in quantitative finance as well as simple curiosity, I started looking into some word-of-mouth ML related algorithms and various useful libraries to solve large scale constrained optimization problem. Perhaps my understanding of optimization has deepened over the years, and the newly bought green matcha tea bag has wrought wonder to my head, the documentation in the open source community with regard to the various numerical libraries seemed exceedingly clear. Here I will simply share a lightly annotated laundry list of all the useful tidbits picked up in 2 hours of distracted self-study:

1. Bayes point algorithm: I first heard about this through a colleague’s work, where the main selling point seems to be the ability to control the model directly through the example weight, rather than feature weight. The definite reference seems to be this 34 page paper, which is clear on a sober afternoon, but can be quite daunting if one simply wants to scan through it to pick up the main idea. I haven’t finished it, but the first 5 pages look quite sensible and promise good return on time spent. My current speculative understanding is that this is a mixture of ideas from support vector machines, which focus on frequentist style optimization problem, and Bayesian inference, which is more expensive but has the nice “anytime” property, meaning even a non-converged model is useful. One thing I found funny was that the authors talked about Cox’s theorem on objective probability; not sure if it is really necessary in a technical paper like this, but authors are certainly allowed to digress a little philosophically.
2. Bayesian linear regression: I learned this mainly through the namesake wikipedia article. Don’t look at the multivariate version, since it distracts you from the main point. The idea is to have a prior of some distribution for the weights (i.e., linear regressors), which can be conveniently chosen to be Gaussian (a conjugate prior). Then the posterior will have some Gaussian distribution whose mean and variance depend on the data as well as the prior. The formula presented towards the end indeed shows that if the prior is highly concentrated near its mean (high confidence), then the posterior distribution will lean towards the prior mean.
3. Gauss Newton method: I thought I knew Newton’s method since grade school. Turns out even a veteran like me can get abysmally confused about the distinction between solving an equation and optimizing an objective function. So I spent a few minutes wondering why Newton’s method is considered a second order method, though to be fair, the label of second order has nothing to do with the use of 2nd derivatives, but mainly with the quadratic convergence rate. To those equally uninitiated readers, to a (vector-valued) equation of the form $F(x) = y$ will typically have isolated solutions only when $x$ and $y$ are of the same dimension (assumed both to be in some Euclidean space). Otherwise you get a sub-variety as your solution, which numerical analysts and engineers typically don’t care for. Similarly optimization only makes sense when the objective function is real-valued, rather than vector-valued. By taking the gradient, one converts the later type of problem to the former. In any rate, I started looking up Gauss Newton, which isn’t exactly Newton Ralphson, after seeing this line in the implementation note of the trf library, which by the way, is extremely well written and makes me understand trust region reflective algorithm in one sitting. In it, the author mentions that one can approximate the Hessian matrix of a nonlinear objective function with $J^T J$. This looked vaguely similar to the typical linear regression exact solution, and in fact is related. As long as the objective function is a sum of squares of individual components, the GN algorithm works, but approximating Hessian with first order derivatives. This obviously can speed up things a lot.
4. fmin_slsqp function in scipy: I found out about this mainly through this blog post. Upon looking at the implementation on github, I got a bit dismayed since a heavy portion of the code is done using python while and for loops. But I keep telling myself not to prematurely optimize, so maybe this will be my first library of choice. The underlying Sequential Least SQuares Programming approach looks somewhat quadratic.
5. least_squares implementation in scipy: This one looks more promising in terms of performance, in particular it uses the trf library mentioned below, which promises to be appropriate for large scale problems.
6. trust region reflective algorithm in scipy: the implementation note for trf above is quite good. The essential idea seems to be to treat a constrained optimization problem locally as a quadratic programming problem and bake the inequality constraints into the objective. Then reshape the region of optimization by the inverse of the diagonal matrix consisting of the distance to the boundaries in each direction (presumably the region is always convex so that this makes sense).
7. Cauchy point: the solution of the gradient descent multiplier to maximize descent under the constraint that the independent variable doesn’t move beyond a certain radius. This seems related to Wolfe condition, but the latter guarantees convergence of gradient norm to  0, whereas Cauchy is just a way to cheaply optimize a single gradient descent step under constraint.

## How do I pronounce my name?

Today as usual I arrived late to company breakfast, due to morning busy work with kids. As soon as I finished ordering, the receptionist asked for my name. So with some nasal stuffiness, I said John. After several repetition, he finally wrote down “Jong”. I had no idea how he came up with this highly unusual spelling. Maybe it is somewhat common among Koreans? Maybe he himself is also Korean? I then corrected his spelling. Just like other receptionists before him, he eventually figured it out and re-uttered my name with a theatrically condescending tone. Outraged by my inability to get it right the first time, I retorted: “how should I pronounce my name?” He then smiled and conceded that he had bad hearing. But surely there is something wrong with the way I handled the nasality of the word “John”. It’s one of those American words that has no dictionary perfect pitch.

My voice tends to lack the resounding quality of a leader, or even a domain expert. I attribute this not to my physical inadequacy, but a general lack of confidence. When I utter a sentence, it usually has not been completely thought out. Even if it has, my mind can vacillate mid-air. Throughout my higher education I have over-emphasized depth and originality of ideas and neglected presentation. It takes considerable deliberation to present a piece of information in a socially convincing manner, no matter how trivial it is. Indeed, great speakers tend to over-sell mundane ideas, over and over, without boring or embarrassing the audience. I might have missed a critical lesson for not going through the brutality of academic job search, which requires an inordinate amount of salesmanship. So as a stage II corporate parasite, I must voluntarily allocate quality time to re-establish my character independence.

## 多努力分布 (multinoulli distribution)

Today a colleague of mine brought up an interesting mathematical statistics problem. What is the Jensen Shannon divergence between two multinomial distributions? In the discussion, he mentioned that he has reduced the problem to looking at binomial distributions. I misheard it as Bernoulli distribution, and started wondering what’s the name of the multinomial analogue of that. Surely it is not called multinomial distribution, since the latter deals with $n$ objects, rather than 1.

Finally I read about Rademacher distribution, which is nothing but a rescaled version of Bernoulli distribution. Outraged by the excessive naming in mathematics, I started looking up the former curiosity.

According to wikipedia, the correct generalizing nomenclature is categorical distribution, or multinoulli distribution. I have never heard of multinoulli before, but the etymology is self-explanatory and Bernoulli seems respectable enough to coin a new word based on his name. The most natural Chinese translation becomes “how much effort?”. In fact, if one restricts to multinommial where $n =1$, then it’s the same as multinoulli distribution.

Back to the colleague’s problem: recall Jensen-Shannon is defined by
$JSD(u, v) = \sum u_i \log 2u_i / (u_i + v_i) + v_i \log 2v_i / (u_i + v_i)$. For $u = multinomial(n, \alpha)$ and $\latex v = multinomial(m, \beta)$, JSD doesn’t make sense unless $n = m$, which we assume. $\alpha$ and $\beta$ are vectors that sum to $1$, and both of the same dimension $k$, otherwise again it doesn’t make sense.

There are $n!/(n-k)!$ different points in the common state space. We can simply calculate the probability under $u$ and $v$ of each point. Taking the case of $k=2$, we are then dealing with the special case of binomial distribution. The calculation eventually reduces to a sum of the form
$\sum_i \binom{n}{i} \alpha^i \log (1 + (\alpha / \beta)^i)$. Mathematica suggests that this is not reducible to closed form in terms of common special functions. I suspected that one could Taylor expand $log(1 + \epsilon) = \epsilon - \epsilon^2 / 2 + \epsilon^3 / 3 - \ldots$ and get item-wise closed form. It is true that each term in the resulting expansion is summable in closed form, but summing them together becomes just as difficult. In fact with each $n$, I end up with $n$ terms in this roundabout summation in mathematica. So I am finally convinced that this problem has no analytic solution.