Chinese Medicine

Over the past few years, I have increasingly noticed how the left side of my body seems dysfunctional at times. For instance, I started growing grey hair exclusively on one side of my head. My left leg is constantly enervated and sensitive to acupuncture points. Between my two kidneys I often feel a sense of asymmetry. While for the most part I couldn’t feel the existence of my right kidney, my left kidney often gets a tingling or even a shredding feeling, for lack of better terms. I have had kidney stone during early graduate school about 7 years ago. It was a tiny piece that came out naturally through urination in the end, but definitely wreaked havoc when I woke up to enormous pain around the abdomen. The doctors and nurses at the emergency room made a quick and accurate diagnosis, but left me without water for half a day just to be completely sure that it was indeed kidney stone, while I lay on the gurney in morpheme-muffled pain. Had I been given water earlier, the acute pain would have been washed away through urination, and spared of the morpheme. But that incidence was not the earliest manifestation of my one-sided malady; kidney weakness had occurred to me even during elementary school. The influence of my father, and surrounding herbal medical culture in China certainly made me more cognizant of the role of kidney in my overall health. The notion of selling one’s own kidney for a living that arose in cinematic works always made me cringe. An English-Chinese bilingual anthology of marvelous anecdotes meant as a ESL reading also mentioned that the adrenal gland shrinks irreversibly as one ages. But it was not until more recently that I start to take kidney health more seriously.

Today my wife suggested that I should give moxibustion a try. This is one of the few oriental treatments she subscribes too, mainly in the context of Gynecology. Thus for 15 dollars we bought a moxibustion box burner together with the moxa incense. With her help, I then lit the moxa inside the burner and fastened the whole thing next to my ShenShu acupuncture point, which is at the same height as the belly button, but on the back, 1.5 chinese inches away from the spine. For one brief moment my left kidney seemed to get a jump start of fresh blood. But after that there was no apparent physiological response, possibly because the cloth pocket insulated too much heat from my skin. Overall the procedure seems pretty harmless, and the proclaimed effect of increasing blood flow to the organs actually makes scientific sense. Whether or not the moxa is doing anything is unclear, but the heat certainly helps. I plan to stick to the routine 2-3 times a week and assess the benefit.

Western medical literature claims almost all positive effects of moxibustion documented in past studies are due to publication bias. While there is definitely truth to that, one often overlooks the fact that western medicine has pretty simple-minded metrics to gauge success, through something as mechanical as p-value. It is nearly impossible to experiment on long term effects, just like in my own work we kept chasing short term measurable gains, but rarely look at long term benefits to the users. Those latter objectives are usually reserved for top executives, so there is much less science involved.

Posted in Uncategorized | Leave a comment

Sanity check Vowpal Wabbit

One reason that I have been working with linear regressions for years and still haven’t been able to move on to more glorious machine learning models like neural networks is that even though the underlying idea is simple, it’s virtually impossible to sanity check the correctness of a linear regression library with naked eyes.

Back at yahoo labs, I wrote down some steps on how to verify that VW and the Weka wrapper of the liblinear library are in fact producing the same results. Being a good corporate citizen, however, I did not bring that knowledge with me when I left. Now that I could care much less about liblinear, VW still is a great tool to carry around. Its sheer speed of training seems unmatched so far on a single machine. So here I will focus on how to sanity check results from VW (v7.3), which would help the user gain a better understanding of its myriad flags as well.

  1. BFGS is the training mode of choice
  2. Issue the following command to train (sim is my own suffix, denoting simulated data):

    vw –bfgs –cache –cache_file=cache.sim -d out-00000-of-00001 –readable_model=readable.sim –passes=10 –termination=0.0000000001 –loss_function=squared –bit_precision=22 –final_regressor=model.sim

  3. test on the original training data set; there are two kinds of prediction output flags, –predictions and –raw_predictions. The former seems to always truncate final prediction > 1.0 to just 1.0.

    vw -d out-00000-of-00001 -i model.sim –raw_predictions=pred.sim –testonly

  4. Create a dummy data file consisting of one feature per row, including the empty string feature denoting the constant term:

    cat out-00000-of-00001 | python invert_feats.py > feats.sim

  5. Generate const + feature weight for each feature using feats.sim:

    vw -d feats.sim -i model.sim –raw_predictions=invert.sim –testonly

  6. Subtract const from feature weights:

    python subtract_const.py invert.sim invert.vw

  7. Concatenate pred.sim and raw data side by side, and feed into a model applier python script, and eyeball agreement of the vw (raw) predictions and python ones.

    paste pred.sim out-00000-of-00001 | python apply_wts.py invert.sim | less

  8. The way I (re)discovered about the raw_predictions flag is through the useful feature called audit. 

    vw -d feats.sim -i model.sim –testonly –audit | less

    It allows you to look at the data value and model value of each feature used in each input example:

    0.068238 20;riversdale%20rd
    Constant:3261788:1:0.130461 w^riversdale%20rd:78240:1:-0.0622233
    0.146634 36;opera%20mini
    Constant:3261788:1:0.130461 w^opera%20mini:2082032:1:0.0161729
    281.882654 562.090180 6 6.0 36.0000 0.1466 2
    0.154288 14;uoskirt
    Constant:3261788:1:0.130461 w^uoskirt:3187096:1:0.0238272

As you can see, the constant term each gets an example value of 1, and its model value is 0.130461. The “w^” token is the namspace of the features that I put in my training data. The big number after the name of the feature is the hash value. There are several sources of confusion with the above steps:

  1. The specification of model file is via -f in training, and -i in test. -f stands for final_regressor and -i stands for initial_regressor.
  2. –readable is not that useful, unless one can reproduce the hashing function used by VW. Let me know if you can implement it in python!!
  3. One shortcoming of bfgs mode is that –raw_predictions can be quite different from –predictions; the former can have a huge prediction loss since it doesn’t truncate the prediction beyond [0,1]. For sanity check against other LR library sgd is the better mode to use. For underdetermined system, don’t expect the feature weights to get even close between two LR libraries. But one should expect the prediction scores to be quite close when both have converged reasonably, since the prediction vector is the closest point in an affine space to the label vector, which is unique.

Below I share the python scripts used to uniquely parse out the features and reconstruct the raw prediction values:

  1. ## invert_feats.py
    #!/usr/bin/env python
    import sys,re,math
    # expect input to be a vw data file
    def upd(d, k, v=1):
      if k in d:
        d[k] += v
      else:
        d[k] = v
    if __name__ == "__main__":
      feats = {}
      for line in sys.stdin:
        tmp = line.strip('\t\r\n ').split('|')[1]
        tmp2 = tmp.split(' ')
        # exclude the "w " namespace part
        for t in tmp2[1:]:
          upd(feats, t.split(':')[0], 1)
      for k,v in feats.items():
        print '%d 1 %d;%s|w %s:1.0'%(v,v,k,k)
      print '0 1 0;|w'
    
  2. ## apply_wts.py
    #!/usr/bin/env python
    import sys,re,math
    # apply weights to a vw example
    if __name__ == "__main__":
      wts_file = sys.argv[1]
      wts = {}
      with open(wts_file,'r') as f:
        for line in f.readlines():
          tmp = line.strip('\r\t\n ').split(' ')
          wts[tmp[1].split(';')[1]] = float(tmp[0])
      for k in wts:
        if k != '':
          wts[k] -= wts['']
      # do paste pred.sim out-00000-of-00001 | python apply_wts.py invert.sim | less
      for line in sys.stdin:
        tmp = line.strip('\r\t\n ').split('|')
        tmp2 = tmp[-1].split(' ')
        res = wts['']
        for t in tmp2[1:]:
          s = t.split(':')
          res += wts[s[0]] * float(s[1])
        tmp3 = tmp[0].split('\t')
        print res, '|||', tmp3[0], '|||', tmp3[1], '|||', tmp[-1], '|||', {k:wts[k] for k in [t.split(':')[0] for t in tmp2[1:]]}
  3. ## subtract_const.py
    #!/usr/bin/env python
    import sys,re,math
    if __name__ == '__main__':
      feat_pred_file = sys.argv[1]
      feat_wt_file = sys.argv[2]  # do not write to the same file in case of confusion
      feats = {}
      freqs = {}
      with open(feat_pred_file, 'r') as f:
        for line in f.readlines():
          tmp = line.strip('\r\t\n ').split(' ')
          wt = float(tmp[0])
          feat = tmp[1].split(';')[1]
          feats[feat] = wt
          freqs[feat] = int(tmp[1].split(';')[0])
        const = feats['']
        for k,v in feats.items():
          if k != '':
            feats[k] -= const
      with open(feat_wt_file, 'w') as f:
        txt = '\n'.join('%.10f %d;%s'%(v, freqs[k], k) for k,v in feats.items())
        f.write(txt)
    
Posted in Uncategorized | Leave a comment

print the first k elements of each group within a comma separated file

Suppose I am given the following csv file:
# input.csv
a,1
a,2
a,3
b,5
b,6
b,7

and I want to produce the first 2 elements of each group labeled by column 1.
# output.csv
a,1
a,2
b,5
b,6

This is not rocket science, but also not as straight forward as say sorting or uniquing in bash. In the special case where k = 1, one could use the following sort syntax:
sort -t, -k1,1 u input.csv > output.csv
However I have been under the impression there is no straightforward way to go beyond k = 1. Today I found a way using awk:
awk ‘!(a[$1]++ > (‘$((k-1))’))’ input.csv > output.csv

thanks to this article:
http://www.theunixschool.com/2012/06/awk-10-examples-to-group-data-in-csv-or.html

Posted in Uncategorized | Leave a comment

To be desired in 2015

This morning I struggled with some logical puzzle as usual. Then I thought, well maybe an easier task is to simply be an observer and list the desirable things in life. So here they are:

1. wireless chargers to clean up the mess under my feet: ever since I suggested the idea of transporting high voltage electricity through the air, I have been aware of how fallacious an initially attractive idea could be. That is not to say such things will never exist, but most likely not in my life time.

2. memory enhancing pill: I have heard a few elite professionals complaining or simply remarking on the finite capacity nature of their memory device. I guess it works like a queue.

3. practical toe-warming socks: the key here is practical. While it is not hard to come up with a battery powered gadget, most probably wouldn’t want to hook some wires near their extremities all day long.

4. nano-pill endoscopy: while I have heard rumors about such devices, I wonder why no doctors ever mention it, if it seems so convenient, painless, and even economical

5. wearable glasses/lens that enables easy typing: this would allow people like me who stay inside for too long to go take a walk during the day and do coding at the same time. Siri is probably not a good substitute for typing even though there are special need people who have pioneered the use of sound driven programming. Eyeball tracking on the other hand may be too slow and painful. So instead, we should learn a separate keyboard system where all 2^10 – 1 combination of the ten fingers can be put to work. This may have to wait for the next generation to get completely used to. A compromise for the current QWERTY generation might be a touch sensitive keyboard on their belly or thigh, that gets reflected on the glass screen, so that the typist can see the positioning of their fingers while they type. For someone like me not used to blind typing, this can make it feel very at home.

6. programmable cooking robot: this doesn’t have to be a robot visually, just provides the damn functionality of cooking some of the most basic cuisines. Right now the biggest dilemma regarding food is that you don’t want to cook, but going out to a restaurant feels like a business trip, or you might be concerned with the bad ingredients they put in the dishes, not to mention the gas cost.

7. near zero-cost food delivery system: this is a natural follow up of item 6. Zero-cost in the eyes of the consumer that is. Even in a plush region like the bay area, getting food delivered electronically to the mouth is far from a reality. Other parts the world like big cities in China are much more used to these kinds of service, but only because of the excess cheap labor supply they still enjoy.

8. Robot that can play with kids

9. Traffic congestions: unfortunately even with the best road design and system engineering, the wave of incoming workforce presents an unpredictable challenge to the traffic problem in the bay area (yes I live in my little bubble world). The cost of building new roads is considered prohibitive here in the US. Some remarkable trivial construction work took upward of 3 years to complete, during my stay at Stanford, one of the richest universities in human history.

10. surveillance of suspicious or catastrophic activities: the idea of body-camera on police offers is truly more symbolic than anything. The fact so many pixel frames are wasted in mundane scenes should arouse the interest of compressed sensing / time series anomaly detection folks. But this one is more of an ontological challenge: how to record a scene without a recorder ready all the time?

Since I don’t have six fingers on either hand, I am going to stop here. Let’s hope the catharsis of unbridled creativity will flush out any toxicity from the brain, poised for the next coding challenge.

Posted in Uncategorized | Leave a comment

Oh dear politics and how to survive in a non-Spartan environment

Today I seemed to have been too complacent after going smooth with coding work. In the afternoon my self-importance is stoked further by coming up with some complicated diagram describing the work I am doing. It not only gives a visual overview, but also allows my manager and collaborator to monitor my progress very closely. Being a forgetful individual I thought this is an ingenious mnemonic device. Indeed I often get so absorbed in pipeline building work these days that by the end of the day or even midday I forget where I am coming from. Taking text notes also doesn’t help so much since parsing texts is typically slow and too linear for certain tasks.

In any case, after sharing the diagram with another colleague, I brought it up again during hipchat conversation, in addition to lots of technical talk about what I am doing and wish to accomplish. This seemed construed as an act to steal the thunder from that engineering colleague who is supposed to take over the work. What we discovered in our earlier conversation was that it made much more sense for me to continue the major code change, a dilemma faced by any company seeing a lot of staff rotation or turnover. So after the dust settled, he asked that I spend no more time on updating the diagram and instead focus on answer some tickets that he created. Fair enough, I said will do but also highlighted the benefit of keeping the diagram for quick reference, and then hanged up immediately since I feel any more presence will expose my interpersonal weakness further. I don’t mind it being viewed as a silent rebuttal. It’s better than wasting time pursuing an argument that would aggravate our relationship further without achieving anything.

So lesson learned? I don’t necessarily have to share progress report with parties of conflicting interest. In this case, sharing with my manager may be a good idea, since at least it demonstrates some thoughtful effort. I can’t treat collaborators all like family members, since their personal objectives are typically misaligned with mine. I am glad I became sensitive of the gravity of the issue after seeing red flag words like “waste time” or perhaps “spend too much time”. In any event it’s better to err on the side of caution. Tomorrow may spell some catastrophic turnout, but I am at least mentally prepared so will not get a sudden deflation of ego. These kinds of social nuances are the most stressful to me during work. I am constantly struggling between the need to be honest and the need to be not too honest, in case it infringes on other people’s territory. I am not yet sure if I really want to hone in on such skills. But if I were to stay in IT I probably have no choice.

Posted in Uncategorized | Leave a comment

从C 语言和李白的共性谈起

最近对C语言有了更深刻的认识。发现之前学校学的都是C++.虽说是C的一个超集,但范围之广,往往用不到纯C的功能。那么C语言跟诗仙如何扯上关系呢?众所周知,李老师善用夸张手法,例如,飞流直下三千尺,天台四万八千丈,金樽清酒斗十千.玉盘珍羞直万钱,与尔共消万古愁,迩来四万八千岁,十步杀一人,千里不留行,等等。可以看出,绝大多数夸张句中运用了大数虚词,颇得山海经中对距离,时间,和高度的描写手法的真传。而C作为底层语言,也经常需要程序员预先分配资源,比如
int * array = malloc(1000 * sizeof(int));
char array[1000000];
等等。 分配的大小往往不是精确的,而是初步估计,或者甚至是懒于计算而得到的一个绝对安全的上限。由于现代计算设备内存日益庞大,甚至有recruiter说出请不要用超过4TB内存这种有钱任性的话,在C这种overhead很小的语言环境里分配貌似浪费的内存资源量可以说是节省开发时间,其效率不亚于抠门地去用高级语言。就好比在买菜时如果不斤斤计较葱蒜的价钱,可以把有效脑力用在炒股和写代码上,也是明智的。而以李白为代表的一批浪漫主义先贤对虚数词的运用自如,更可看出人类在富裕宽松的心理环境下可以更好的激发创造力。
另外在数学中的分析领域,很多牛逼定理的证明靠的都是大分析师如Jean Bourgain之流在恰当的时刻信手拈来几个大得离谱的常数作为脑海里的航标,才使其证得酣畅淋漓,游刃有余,同时也把那些只会逐字逐句校验而无法心神意会的初学者给拒之门外,用恫吓式证明(proof by intimidation)的方式告诉他们,你们还太嫩。当然把C和数学分析做类比略显牵强。分析几乎是纯二+阶逻辑的产物,而代数组合学其命题本身多基于一阶。所以从这个层面上说分析更像python,lua. 很有意思的是,计算机语言的高级阶段力求让用户能跟高效的表达数学或算法中的精华逻辑,这在数学中似乎已经是明日黄花了。当今前沿数学,不仅仅是分析,追求的是计算机所无法系统概括出来的对象和逻辑。所谓的经典代数,正是那些无论证明还是命题都比较接地气的,也就是一阶逻辑就可以搞定的内容。当然并不是所有数学家都能踏着二阶甚至高阶逻辑的风火轮通行无阻的,不然这帮人就活得太潇洒了。很多人其实是把别人打通的经脉嫁接到自己身上,然后需要的时候拿出来重走一遍。不能说是简单的抄袭,对普及某些精妙的构架或思想也有很多积极意义。一种简单但非绝对的鉴别真实二阶思考的方法就是看证明中是否有不负责任的大数,甚至可以用类似zipf定律的方式来更进一步验证其真伪。当然这也很容易用来误导,或者滥用。
另外有一点是数学和C语言所共通的,就是当一个证明或一段代码成为经典之后,其中的懒散大数会被好事之徒精简到和经典代数组合学一样的标准。当然精简的并不局限于几个常数,还有可能是逻辑本身。这在代码中的重要性不言而喻,而且一般会在初稿形成后不久就发生。当然这在早期计算机发展阶段比较平凡,当今重写代码似乎更流行。但数学证明一经发表永载史册,不论糟粕精华。其他学科也是如此,而且糟粕更多。这种空前绝后的不朽性深深刺激了码行中的一些人,故而产生github,bitbucket等的平台。如果光是代码永久保存还好,可惜有滥用者开始上传数据文件。最终当然资源是会被用爆的。

Posted in Uncategorized | Leave a comment

math and buddhism

It turns out the central quote of my last diary entry had unknown origin in Buddhism as well, at least according to a recent app I downloaded that preaches daily buddhist wisdom. Given my inability to assimilate into Judeo-Christian philosophy, which is rightfully parallel to Confucianism, I thought I’d give Buddhism a try. After all it had a proven track record among Asians like me. Furthermore modern Buddhism seems unbound from carnal prohibitions or technological distractions. But what really captivated me is the opacity and insinuation of its scripture. The translation from Sanskrit to archaic Chinese was done with poetic elegance. Unlike King James’ bible, I found the Buddhist texts in Chinese more monumental, a true memorabilia. Perhaps I am just naive and only admire the ostentatiously graceful. Anyhow I find it boring to spend hours immersed in uninspiring condescending teaching and expect it to be useful at some point of time in the distant future. Maybe I will lavish such faculty in research, but in religion I expect to find more immediate rewards, which is not even something that caters to base desires, of which I have many.

So to connect back to math. I find myself cornered by my carnal feelings lately. While in the past I have tried the human-centric approach as offered by Christianity, I found its strength quite fleeting and the progress unpredictable. This is mostly again due to the inner boredom I experience while trying to come to term with the faith. Buddhism on the other hand emphasizes intelligent comprehension and internalization, or more melodramatically, inner echo. There is less of the universal acceptance that JC preaches, but more on the leveraging of individual talent. Thus not everyone is bound for the brightest next life. As a slight digression, I think the notion of a next life is more appealing than an eternal life in an unknown place called heaven that may not even have an iPad. But I am not even that long-sighted, so the more salient concern is intellectual challenge and satiability. Christianity seems to take the Machine Learning approach of addressing the most wide-spread issue within the target population. Thus it probably reaches the widest set of audience, rich or poor, regardless of race, education, and other divisive traits. It is true that Buddhism has reached up and down to all classes of Chinese society from ancient times up to now. But due to its lax codification and institutionalization, those lower in the social ladders are probably practicing simplified variant of the original true teaching, and thus there is hope of aspiring to more puritan form of the religion. Being of a math background dictates that I am of the constantly aspiring type. I get upset when my close family members lose the interest to fight an uphill battle. On the other hand I need a religious recourse in addition to academic or social venues to exercise my upward instinct. It is this constant fire of ambition that troubles most of my adult life. Of course at various intervals of time I had physical downturns that prevented me from fulfilling my Oedipal curse, but I was certainly not content in those stagnant states.
Now I truly hope this newly acquired spiritual project will pull me away from unproductive thoughts and behavior, and rekindle my intellectual curiosity in all spectrum of things. After things like Fermat’s last theorem or Poincare’s conjecture has always been on my wish list to occupy my post-graduate life.

Posted in Uncategorized | Leave a comment