Sanity check Vowpal Wabbit

One reason that I have been working with linear regressions for years and still haven’t been able to move on to more glorious machine learning models like neural networks is that even though the underlying idea is simple, it’s virtually impossible to sanity check the correctness of a linear regression library with naked eyes.

Back at yahoo labs, I wrote down some steps on how to verify that VW and the Weka wrapper of the liblinear library are in fact producing the same results. Being a good corporate citizen, however, I did not bring that knowledge with me when I left. Now that I could care much less about liblinear, VW still is a great tool to carry around. Its sheer speed of training seems unmatched so far on a single machine. So here I will focus on how to sanity check results from VW (v7.3), which would help the user gain a better understanding of its myriad flags as well.

  1. BFGS is the training mode of choice
  2. Issue the following command to train (sim is my own suffix, denoting simulated data):

    vw –bfgs –cache –cache_file=cache.sim -d out-00000-of-00001 –readable_model=readable.sim –passes=10 –termination=0.0000000001 –loss_function=squared –bit_precision=22 –final_regressor=model.sim

  3. test on the original training data set; there are two kinds of prediction output flags, –predictions and –raw_predictions. The former seems to always truncate final prediction > 1.0 to just 1.0.

    vw -d out-00000-of-00001 -i model.sim –raw_predictions=pred.sim –testonly

  4. Create a dummy data file consisting of one feature per row, including the empty string feature denoting the constant term:

    cat out-00000-of-00001 | python > feats.sim

  5. Generate const + feature weight for each feature using feats.sim:

    vw -d feats.sim -i model.sim –raw_predictions=invert.sim –testonly

  6. Subtract const from feature weights:

    python invert.sim invert.vw

  7. Concatenate pred.sim and raw data side by side, and feed into a model applier python script, and eyeball agreement of the vw (raw) predictions and python ones.

    paste pred.sim out-00000-of-00001 | python invert.sim | less

  8. The way I (re)discovered about the raw_predictions flag is through the useful feature called audit. 

    vw -d feats.sim -i model.sim –testonly –audit | less

    It allows you to look at the data value and model value of each feature used in each input example:

    0.068238 20;riversdale%20rd
    Constant:3261788:1:0.130461 w^riversdale%20rd:78240:1:-0.0622233
    0.146634 36;opera%20mini
    Constant:3261788:1:0.130461 w^opera%20mini:2082032:1:0.0161729
    281.882654 562.090180 6 6.0 36.0000 0.1466 2
    0.154288 14;uoskirt
    Constant:3261788:1:0.130461 w^uoskirt:3187096:1:0.0238272

As you can see, the constant term each gets an example value of 1, and its model value is 0.130461. The “w^” token is the namspace of the features that I put in my training data. The big number after the name of the feature is the hash value. There are several sources of confusion with the above steps:

  1. The specification of model file is via -f in training, and -i in test. -f stands for final_regressor and -i stands for initial_regressor.
  2. –readable is not that useful, unless one can reproduce the hashing function used by VW. Let me know if you can implement it in python!!
  3. One shortcoming of bfgs mode is that –raw_predictions can be quite different from –predictions; the former can have a huge prediction loss since it doesn’t truncate the prediction beyond [0,1]. For sanity check against other LR library sgd is the better mode to use. For underdetermined system, don’t expect the feature weights to get even close between two LR libraries. But one should expect the prediction scores to be quite close when both have converged reasonably, since the prediction vector is the closest point in an affine space to the label vector, which is unique.

Below I share the python scripts used to uniquely parse out the features and reconstruct the raw prediction values:

  1. ##
    #!/usr/bin/env python
    import sys,re,math
    # expect input to be a vw data file
    def upd(d, k, v=1):
      if k in d:
        d[k] += v
        d[k] = v
    if __name__ == "__main__":
      feats = {}
      for line in sys.stdin:
        tmp = line.strip('\t\r\n ').split('|')[1]
        tmp2 = tmp.split(' ')
        # exclude the "w " namespace part
        for t in tmp2[1:]:
          upd(feats, t.split(':')[0], 1)
      for k,v in feats.items():
        print '%d 1 %d;%s|w %s:1.0'%(v,v,k,k)
      print '0 1 0;|w'
  2. ##
    #!/usr/bin/env python
    import sys,re,math
    # apply weights to a vw example
    if __name__ == "__main__":
      wts_file = sys.argv[1]
      wts = {}
      with open(wts_file,'r') as f:
        for line in f.readlines():
          tmp = line.strip('\r\t\n ').split(' ')
          wts[tmp[1].split(';')[1]] = float(tmp[0])
      for k in wts:
        if k != '':
          wts[k] -= wts['']
      # do paste pred.sim out-00000-of-00001 | python invert.sim | less
      for line in sys.stdin:
        tmp = line.strip('\r\t\n ').split('|')
        tmp2 = tmp[-1].split(' ')
        res = wts['']
        for t in tmp2[1:]:
          s = t.split(':')
          res += wts[s[0]] * float(s[1])
        tmp3 = tmp[0].split('\t')
        print res, '|||', tmp3[0], '|||', tmp3[1], '|||', tmp[-1], '|||', {k:wts[k] for k in [t.split(':')[0] for t in tmp2[1:]]}
  3. ##
    #!/usr/bin/env python
    import sys,re,math
    if __name__ == '__main__':
      feat_pred_file = sys.argv[1]
      feat_wt_file = sys.argv[2]  # do not write to the same file in case of confusion
      feats = {}
      freqs = {}
      with open(feat_pred_file, 'r') as f:
        for line in f.readlines():
          tmp = line.strip('\r\t\n ').split(' ')
          wt = float(tmp[0])
          feat = tmp[1].split(';')[1]
          feats[feat] = wt
          freqs[feat] = int(tmp[1].split(';')[0])
        const = feats['']
        for k,v in feats.items():
          if k != '':
            feats[k] -= const
      with open(feat_wt_file, 'w') as f:
        txt = '\n'.join('%.10f %d;%s'%(v, freqs[k], k) for k,v in feats.items())

About aquazorcarson

math PhD at Stanford, studying probability
This entry was posted in Uncategorized. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s