Why the following 3sum closest solution does not work?

class Solution {
public:
    void twoSumClosest(vector<int>& nums, int target, int end, int offset, int& s, int& d) {
        int i = 0;
        int j = end;
        while (i < j) {
            int ret = nums[i] + nums[j] + offset;
            int dist = abs(target - ret);
            if (ret == target) {
                s = ret;
                d = 0;
                return;
            }
            if (d == -1 || dist < d) {
                d = dist;
                s = ret;
            }            
            if (ret < target) {
                ++i;
            } else {
                --j;
            }
        }
    }

    int threeSumClosest(vector<int>& nums, int target, int left, int right) {
        int s;
        int d = -1;
        while (left <= right) {
            int mid = (left + right) / 2;
            twoSumClosest(nums, target, mid - 1, nums[mid], s, d);
            if (d == -1 || s < target) {
                left = mid + 1;
            } else if (s == target) {
                return target;
            } else {
                right = mid - 1;
            }
        }
        return s;
    }

    int threeSumClosest(vector<int>& nums, int target) {
        sort(nums.begin(), nums.end());        
        return threeSumClosest(nums, target, 0, nums.size() - 1);
    }
};

The idea above is to first sort the array, then do binary search on the first number, and find the closest 2sum in the complement. We keep track of the sum s and distance d.

It turns out I cannot even do a single outer layer binary search. The reason is that the condition

if (d == -1 || s < target) {
// or not

is not sufficient to inform that

left = mid + 1;
// or right = mid - 1;

if the realized sum s is < target, it is tempting to think the best first number should be somewhere to the right of existing chosen one (mid), but it is quite possible that it is to the left: the remaining two numbers just might need to be larger to compensate for it.

So it seems brute force search of the first number is the only option. One interesting thought. Since the above wrong solution passed 88 out of 99 unit tests, it seems interesting to come up with wrong solutions that pass all leetcode tests, and perhaps in the future that pass Generative AI logical reasoning. This could hopefully advance AGI.

class Solution {
public:
    void twoSumClosest(vector<int>& nums, int target, int start, int offset, int& s, int& d) {
        int i = start;
        int j = nums.size() - 1;
        while (i < j) {
            int ret = nums[i] + nums[j] + offset;
            int dist = abs(target - ret);
            if (ret == target) {
                s = ret;
                d = 0;
                return;
            }
            if (d == -1 || dist < d) {
                d = dist;
                s = ret;
            }            
            if (ret < target) {
                ++i;
            } else {
                --j;
            }
        }
    }

    int threeSumClosest(vector<int>& nums, int target) {
        sort(nums.begin(), nums.end());     
        int s = 0, d = -1;
        for (int i = 0; i < nums.size() - 2; ++i) {
            twoSumClosest(nums, target, i + 1, nums[i], s, d);
            if (d == 0) {
                return s;
            }
        }
        return s;
    }
};
Posted in Uncategorized | Leave a comment

Consider joining xAI

I am a bit slow in following the news despite wasting no toothbrushing/driving time to watch youtube channels. One of my biggest idols since joining the industry, Christian Szegedy, recently moved from the google to the company that seeks to understand the universe. While I enjoy reading Roger Penrose’s Road to Reality, a complete guide to the laws of the universe, I never imagined myself in a similar position.

After reading the job posting, however, I realized math figured prominently in their mission. That’s not surprising. Christian has been pursuing automated theorem proving (for lack of a more professional term that I can never remember), ever since his Batch Normalization and CNN fame. In science, accessible work tends to attract more citations. But it’s the neural theorem proving stuff that attracted my attention.

Now there are many concerns about whether I can pass the interview let alone be competent at the position. I boasted the ambition to contribute to the LEAN theorem prover code base but never managed. My academic track record is meager.

I do have some strengths, including fast and meticulous execution of machine learning pipeline. I have a knack for formulating math problems out of industry problems, and combine branches of applied math to impress the reviewers. I can hardly think of much else these days. To maintain both skill excellence and work/life balance with kids seems impossible. Taking vitamin pills may help but it may take some time, especially needs to withstand the test of the upcoming winter chills.

But anyway I have accepted the offer for a verbal exchange in the coming week. Success lies in the ones who prepare as evidenced by my recent launch review encounter. Best form of preparation may be in the form of writing blog posts since I get disproportionately trained in input but hardly any output, especially outside the coding sphere.

One thing I may propose to Musk, if I get the chance, is to build some tool to facilitate the understanding of scientific literature, in the form of an IDE. I think that’s what Christian had in mind as well. He firmly believes in a rigorous framework for proof validation, before or one step ahead of neural onslaught. If there is a viable place for me to contribute to that ambition, I am all for it, while my youthful idealism lasts.

Lastly a shoutout to my wife for being super supportive of my financially downhill proposition. She comforted me by saying we spend way below our income level. But we all know it’s not just about current spending level but psychological security. It’s a sacrifice that she may not realize the gravity of. But it’s blessing to be surrounded by idealist like her, no matter how uninformed it might be.

Posted in Uncategorized | Leave a comment

Gumbel Softmax is basically TopK with some noise

I have been mystified by the notion of Gumbel Softmax. Even though the idea is very simple (we add Gumbel noise to make top k equivalent to sampling without replacement), papers and blogs make it appear very theoretical, and it’s hard to connect it to actual implementation.

So I decide to write this short note to help people similarly frustrated and in need of a quick refresher.

First, the major advertised appeal of Gumbel Softmax is the fact that it allows back-propagation in a seemingly non-differentiable operation, that is sampling. But really there is nothing miraculous here. The top-k operation is similarly non-differentiable. However we can do something called pass-through differentiation to make it so:

\frac{\partial}{\partial t} \max (\{x_1(t), \ldots, x_n(t)\}) \approx \frac{\partial}{\partial t} x_j(t) \mid_{j = \arg\max_i\{x_i\}}

Here \max operator is the same as top-k for k = 1. The above equality is approximate because the \max operator is not differentiable when x_i = x_j for some i \neq j. However this happens also for ReLU and with zero probability at any given point in training, so we consider it a harmless approximation.

Next Gumbel Softmax is simply adding an i.i.d. Gumbel noise to each x_i above. By doing that, you achieve exact sampling according to the probability distribution given by a suitably normalized (x_1, \ldots, x_n). This is intuitively clear, since the higher x_i‘s will not be affected by the noise added as much as the lower x_i‘s, hence remains highly probable to be chosen by the top-k operation.

Posted in Uncategorized | Leave a comment

How do we see that the point at infinity of a generic cubic curve is an inflection point, geometrically

In Silverman and Tate, section 1.4 on group law, they claim that the point at infinity of a cubic curve is an inflection point, without much elaboration.

x^2 = x^3 + ax^2 + bx + c.

After consulting with ChatGPT, I got the following argument: since the point of infinity corresponds to the vertical asymptotes, which have two components, one pointing up and one pointing down, we can think of them as joined at a single point at infinity. The upper component is clearly convex and lower component concave, hence the the point that joins them changes convexity from positive to negative, and must be an inflection point.

Below is the ChatGPT transcript

Me: how to see geometrically that the point at infinity of a cubic curve is an inflection point?

ChatGPT: An inflection point on a curve is a point where the curvature changes sign, i.e. the curve changes from being concave to convex or vice versa. In the case of a cubic curve, the point at infinity can be thought of as an “ideal” point that the curve approaches but never reaches. By considering the tangent line to the curve at the point at infinity, we can see that the curvature changes sign as the curve approaches this ideal point. This is why the point at infinity is considered an inflection point of a cubic curve.

Another point of confusion for me was the claim that there is a triple root at X = 0 because of the equation X^3 = 0, even though it corresponds to the intersection of the lines x = \rm{constant} with the line at infinity. The reason for the apparent contradiction 0 = \rm{any constant} is that for points at infinity, we only consider the asymptotic directions, not the actual point. Because y grows faster than x as they approach infinity on the cubic curve, x can be treated as “stationary”.

The non-singularity of the point at infinity is also easy to see geometrically: the tangent is simply the line x = \rm{constant}. To verify these claims algebraically, one can take the partial derivatives of the homogenized equation and then set z = 0. The inflection point can also be verified by looking at the 3 \times 3 Hessian.

It’s a little tricky to see that the Hessian for a projective curve encodes the same information as the Hessian of its affine version. An even more basic question is why the Hessian of the defining equation has the same meaning as the Hessian of a graph (or parametrized curve)? One may need to invoke the implicit function theorem somehow.

Posted in Uncategorized | Leave a comment

farewell address

The past 9 months at XYZ have been one of the most intense in my career so far. I have learned a tremendous amount in a very short period of time, thanks to generous support of my manager, my onboarding buddy and many other folks on the team. I am also thankful for the opportunity to work with other teams as well as hosting an intern.

Due to family reason and personal connections, I have decided to move on to a different opportunity. I hope I have made some impact here at XYZ, and that I have shared something useful to the team. The mission at XYZ still makes perfect sense to me and it remains a noble pursuit. However I think there is a mismatch between my skill set and area of interest and the type of work that is required.

So I hope to stay in touch with many of you. Please keep me posted on the latest at XYZ. Given such a diverse pool of talents, I am confident you all can make it to the finishing line!

Posted in Uncategorized | Leave a comment

A great paper on internal workings of some deep learning models at google

In the CS publishing industry, normally people emphasize a lot on reproducibility of results. This usually entails that you need to open source your code, use public dataset, and disclose all the steps in your experiment, including data preparation, hyper-parameter configurations and using publicly available compute resources like GPUs.

What has transpired in the past decade or so in the deep learning world, however, is a trend towards satisfying those constraints and requirements at a superficial level, by cherry-picking results and baselines, without much regard to the actual practicality of the results published.

As a result, there has been a counter-force in the brewing over the past few years, where high profile companies like FAANG or MAANG are encouraged to submit their work to the so-called industrial track, with the aim to disseminate industry best practices backed by actual launches, with any unfair advantage of compute resources behind the work, as opposed to proof of concept paper mills that are used by tenure committees. I had the pleasure to participate in one of those. Granted they felt less self-contained, and rely on your personal credibility to make the case, nonetheless for someone not trained in the paper production industry, this is an invaluable way for me to share some insights quickly from the work I did.

Despite my own contribution, normally I don’t pay too much attention to these industry track papers because they usually just document some work history of highly paid ML employees in well established MAANG-like companies. But this week I ran into one on twitter that caught my attention: On the Factory Floor: ML Engineering for Industrial-Scale Ads Recommendation Models. I am personally well acquainted with two of the authors, and they are both highly accomplished at google. I also know enough about their career trajectories at google to understand the background of techniques they describe, especially the first half of the paper, which contains a lot of all-too-familiar strategies we tried at google search.

I am honestly pretty surprised that google these days allow such explicit technical details to be disclosed to the public. I guess they have come to realize that it’s the sheer volume of data that matters, not the actual ML design choices, at least not as much at this point. Without further ado, I will highlight some important excerpts I learned (or confirmed with my own shared experience) from that paper.

The entry point of discussion is a tradeoff between model efficiency and accuracy. In slightly more corporate ML language, accuracy is also known as quality, at least in the context of search relevance. The paper outlined a few tricks to reach higher tradeoff points between these two objectives, such as low rank bottleneck layer, something I learned there but did not see publicized widely. It then spends one subsection discussing AutoML which I thought it did a really great job.

Previously I felt unsure that AutoML would work at the scale outside google, and wasn’t completely sure how it works inside either. A similar direction called population based training or PBT for short, caught on for a while, especially by deepmind, then tapered off because it simply required too much compute power. I have a high school acquaintance who worked at Waymo and implemented the strategy successfully. But then she left and went back to google, or so I heard. But the AutoML approach taken by this paper felt very practical.

The basic idea is to have a pool of trainable weights, from which multiple different hyper-parameters being explored can share weights. Mathematically this does not make much sense, since an (i, j) matrix element of one model has nothing to do with the (k, \ell) element of another.

But that’s the beauty of deep learning, which I am reasonably familiar at this point. As long as the overall capacity adds up, parameter sharing or other approximate scheme might just work, and best of all, the theory is beautiful!

Another important area to trim excess model fat is data sampling, which the paper also devotes a full subsection. Indeed at google scale it is wise not to keep all samples, especially negative samples. This is also something I advocated at my current company: we should sample more potential positive examples for annotation, simply by keeping only the top k results according to current search ranking system, for a very small k say 2 or 3, instead of 10 or 20 that we are currently doing. Unfortunately such suggestion does not seem to resonate well with the crew or management. The paper does mention that multiple sampling strategies can be applied in conjunction, e.g., ones based on display UI, and another on logits and labels.

All of the above can be considered model efficiency/accuracy tradeoff. The next section deals exclusively with ways to improve accuracy, however, efficiency still needs to be considered obviously. There, the paper outlines 3 major techniques.

First comes loss engineering. This encompasses techniques like rank loss and distillation.

For rank loss, the basic idea is to use pairwise examples in conjunction with pointwise. Either the two losses share the same predictor, in which case some per example bias correction must be done, an idea I toyed with in multiple occasions but never saw it implemented. Or two separate predictors feed into the point-wise and pairwise losses separately. In that case the two predictors will share as many parameters as possible, so that the pairwise predictor (presumably not used at inference) will serve to regularize the point-wise predictor.

Distillation, especially the teacher/student kind, was something I witnessed at google. It felt impractical given that training a single teacher was expensive enough. Fortunately one could leverage the logged previous model score as a teacher for free. I believe that was the idea proposed here. In that case, indeed we only need to worry about student training cost.

Lastly curriculum learning was added as part of loss engineering as well. But that’s just a vast generalization of learning rate schedule. For instance, with multi-task learning, one could turn on certain task only after a certain number of steps. There is less mathematical elegance, but my own experience also suggests it’s pretty useful. For instance, one could free embeddings after a certain number steps to prevent the model from memorizing the training data.

A second pillar of accuracy improvement is the introduction of second order optimizer, pioneered by Rohan Anil in his earlier paper on the Shampoo optimizer. The basic idea as I understood it is that instead of computing the full Hessian matrix, we only compute the restricted ones for each meaningful block of weights, for instance, a kernel matrix or a bias vector. It’s not clear if they have tried LBFGS or other classical low rank approximation techniques even on top of such restricted Hessian setup, but the overall efficiency/accuracy tradeoff reported is decent: +0.44% accuracy (compared to 0.1% for normal launches) vs +10% training cost. Still I suspect this launch was a bit contentious and nerve-wrecking for some.

Deep Cross (detailed in DCNv2) was the last area of improvement mentioned. There is very little mathematical elegance, and the gain looks smallish too. The paper was pretty honest about that, but does a good job summarizing the main idea within a single paragraph. It also mentions a bunch of localized tricks that seem needed for Deep Cross to work well, including AutoML and curriculum ramping.

The last three sections deal with more esoteric optimization objectives, such as model reproducibility/robustness and calibration/credit attribution. Strangely, the paper divides calibration/attribution into two sections, one discussing the problem itself, and the other the proposed solution. I also spent the most time puzzling over various sentences among these 3 sections, both because I work less in those areas, also that the objective is less well-defined, or needs to be defined by the problem solver him/her-self.

TBC

Posted in Uncategorized | Leave a comment

Conversation with the VP in the middle of a PTO

Even though I am on vacation this week, I had the humility to attend a weekly 1-1 with our VP to discuss an ongoing project. Last time, I brought out a lot of the points of contention with him, and he responded with positive feedback. So I thought I should do even more of that this time, though not necessarily with a stronger tone.

I mentioned that I felt our general strategy of using a lot of human annotation as direct input to the system is not scalable, and noisy. In a way I am questioning the existence of an entire team that interfaces between highly fine-grained customer-specific annotation tooling, and the internal working of our ai system. I sensed that he understood my meaning very clearly this time, and nodded firmly. What surprised me was that he responded by saying he was surprised of my feeling. Furthermore he called it a criticism without a better solution, that is, if I criticize something, I should bring a better solution to the table. I did not mean it as a criticism to him personally, but I guess since he’s the head of the organizations, as well as the original designer of the orgs, whatever negative feedback on the system can only be interpreted as a personal criticism.

Then he even pushed me further whether I am ready to resign. Fortunately the news of how Timnit Gebru was forced to resign by herself was still fresh in my mind. So I gave a reluctant but firm no. At that point it should have been clear to him that I am faking it. But to save any embarrassing public drama, we both refrained from dwelling on the subject further. After that failed attempt at inducing a self-resignation, I also became instantly much more alert and careful with my words. I know that this is no longer a guy I can trust with my true feeling, even at the product/technology level. I refrain from further criticism, and tried my best with zoom facial expression to swallow some bitter Galilean pills (thank god I insisted staying home for our PTO-hijacking meeting, to which he courteously apologized).

I guess the message is pretty clear. I am not welcome at this company any more, not only by my manager, but by his manager as well. The follow-up alignment 1-1 was really a test of continued employability, something I did not anticipate, due to lack of workplace maturity.

Back a few weeks ago, my manager insisted that I attend a 3 day offsite, despite my colloquial family excuse. At the time I felt it was my right to stand firm against such unreasonable demand. When I submitted my PTO this week that included the offsite, my manager brought it up again, and was quite insistent still, even offering to talk to the HR for last minute trip arrangement. Now it became clear to me that this is part of the startup cult that I must live with, or there will be no trust or bonding in future collaboration, or the conception and preservation of a common baby, the birth of this lofty goal that everyone is claiming a part of.

To survive in a startup, one needs not just personal connection with your coworkers, but a religious trust and bonding with them. It is about selling a part of you to the cause, and pushing yourself to a corner without the possibility to back off. I clearly lack this type of enterprising abandonment. I am a PhD with a decent career path so far. There are many interesting research-y ideas I want to pursue in the name of product improvement in the coming years. Family is also a big part, if not the biggest part of my life, especially in contrast to the sometimes minuscule impact I feel from the startup work I do.

As of me, there should be no ill feeling towards any of the leadership crew, or my coworkers. My tenure at any given workplace is roughly proportional to the level of accomplishment at each. And that in turn is proportional to how passionate I felt about the technology behind the work, not necessarily the practical product impact it creates. I know many would argue this is at odds with careers in industry. To me however this is an extension of the academic life that I am a few generations too late to pursue. I don’t consider myself excessively academic either. I don’t write dozens of academically flavored research papers a year, nor do I try hard to sell my big ideas in major conferences.

I enjoy the purity of solutions as the VP correctly pointed out. I don’t mind spot fixes or patches as long as they occupy a well-defined, well-constrained place in the system, as long as there is a clear picture of the overall design, and as long as there is a clear sense that the wins from systematic model based approaches have been more or less saturated. Instead I was given no concrete directions, a problem to solve, and also approaches that I am not supposed to use, even though they may still have to accept the approaches if they turn out fruitful, and even though more junior members are given the license to try them however they like. A very weird governing philosophy, as I told one recruiter, google search on steroid.

Posted in Uncategorized | Leave a comment

The DRY principle

I am a big believer in the DRY principle. But that’s not a substitute for good designs. I still believe a perfect design is amenable to a completely DRY implementation, but of course in reality one needs to deal with well established languages/packages, that do not work well together necessarily.

One instance of my attempt to enforce the DRY principle at my current company is to refactor their yaml configuration files, through something called deep import, that I learned from another popular configuration language hocon.

Here is how yaml supports the notion of inheritance. When you concatenate two yaml files, the later one overrides the earlier one, but only for the top level keys. So essentially a dictionary update is performed, in the python parlance. In practice, however, we work with not just a flat dictionary, but a nested dictionary. Then there is no way to perform a nested/deep update, where you preserve all the information in the earlier configuration, but only update the value at nested key path.

I suppose this kind of features is not taught at schools for those fresh graduates at my current company: it’s probably too trivial to be mentioned in any cs class. However without the ability to override in a nested fashion (i.e., beyond top level keys), there is no clean way to enforce DRY in building an army of configuration files, which is what’s being done at my work.

Now granted, yaml does offer another important builtin feature, namely anchors. Using anchors, one can give any piece of configuration block a name, and reuse it later in the same file. This can be viewed as the global variable approach in the programming world. But I guess there is a reason we have OOP and inheritance.

After some initial resistance and doubt, the team finally assimilated the idea and now it’s an integral part of the configuration ecosystem. The implementation is pretty straightforward: build a stack of imported yaml files, and apply deep update in a depth first manner. There are several catches though.

  • Sometimes we want to override the deep imports locally: for instance, for a top level key, we do want to override its value in its entirety. This can be called an atomic override. Indeed that is the solution I came up with, back in my previous company already in the context of hocon: support an atomic override special key.
  • Anchors and deep imports do not work well together, since anchors refer to an explicit block of configuration. With deep import however, it becomes ambiguous whether the block should include parent information. With shallow import this was never a problem.
  • Deep import followed by shallow import is a recipe for unexpected behavior. Shallow import is implemented simply through concatenation of yaml content, whereas deep import converts yaml content to a json dictionary first. Thus once we start using deep import, we we are in the json land and farewell from the yaml land.
  • List (or repeated field in proto parlance) does not support nested update. This is in contrast to proto merge_from utility, which treats list append as the analogue of dictionary update. What i did in my previous company was to implement a + operator in that situation, and later I also added support for the – operator. But in hind-sight I think that was a bit of overkill.
  • The speed of parsing is now O(n^2) instead of O(n), where n is the number of files in the import stack. This is because the shallow import is simply concatenation followed by libyaml parsing, which is all in c++, whereas the stack approach needs to call libyaml n times. But in theory if we write everything in c, the difference shouldn’t be that big.

The yaml specific catches above I learned at my current job, but the majority of the language independent consideration already matured before. One conclusion I had at the end of my previous tenure was that it’s probably a good idea to get rid of configuration language entirely and use python script directly in their place, since python is so flexible and bug-free (at least as a configuration language), I didn’t need to worry about cross-lingual compatibility, or the need to add additional support by modifying the parser itself.

Now back to the general topic of DRY. I hear complaints on various forums that people often abuse it at the expense of poor design choices. To re-iterate my earlier point, in the perfect world, DRY is flawless, just like the universe is probably made up from a few simple axiomatic principles, at least according to the theoretical physicists. So I almost never get mad at people who try to enforce DRY, maybe because people I worked with in the past are usually good enough with design strategies, or that the bad ones haven’t even bothered looking into DRY yet. But I can understand how in certain work environment, less capable people are incentivized somehow to enforce DRY before other more important aspect of engineering. I could be one of those people in certain people’s eyes.

But recently I had one incident with my manager, who manifested the vice of DRY-zealotry in one of the most unlikely situations. During our 1-1, I mentioned, perhaps for the 5th time, that I am working on some project that involves the use of faiss, a popular tool open sourced by facebook that allows efficient vector similarity retrieval. Really it’s just a tool, and my goal is to do some clustering, some nearest neighbor retrieval, as a way of data mining/exploration. To me the magic of ANN is the most nontrivial way I can think of to do NLP related data mining.

From the start, my manager felt I should not invest my time in this area. And somehow whenever he hears the word faiss, his immediate reaction is that I am wasting my salary on things that he doesn’t think is important, or that someone else on the team is already working on. In his mind, there should be only one person in the entire team working on faiss. Everyone else either shouldn’t touch it, or should take the output of that one person as input. Now this is starting to sound like the DRY principle in engineering. Being an engineer for most of his life, I can sympathize with his motivation. He wants the whole team to be like an intricate piece of software, where every engineer plays the role of some library, who can call each other, without any redundancy, and completely under his master design and control.

So I was taken to this other colleague, who recently used the work done by the other colleague who has spent much time understanding the faiss library. At that point I wish I had the courage/decisiveness to tell him that I am not re-investing all the time to learn about the faiss package and compare its various setting. To me it’s just a tool, similar to python, jupyter, or numpy. I don’t think he would be averse to the idea of multiple people working with numpy directly. But somehow faiss is a mysterious thing that made him uncomfortable that more than one person is using it directly. In reality, I am using faiss only as an off the shelf batch matmul library, since I am only using the brute force mode of its retrieval functionality.

My manager’s reaction to take me to this other colleague, and question my judgment of not leveraging existing team members work, felt like a gross mistrust in my ability and leadership. Perhaps he is so busy these days with all the firefighting operations at our small company, that he forgets about some basic courtesies. In a similar earlier storyline, I was almost forced to merge my work on integrating some deep learning pipeline I built into an existing system of another team, though not initiated by my manager, but a member of that other team. I understand small companies want to be lean. But forced merging isn’t necessarily the right way to reduce redundancy. People design things in very different ways. A forcefully merged product may be twice as bad as the components separately. What I find works better is to have a shared library of common utilities, that both sides try to polish and maintain.

I also don’t mind merging my work with others, provided the two have very similar design taste in mind. I am all for convenience and DRY, but the other party is all about bloated configuration/pipelines. Even if we merge successfully, my workflow would be 5x more time-consuming. There was simply no motivation for me to pursue that direction unless I am drunk (which I have never been).

But I don’t blame the other party either. I know they are relatively fresh graduates, with lots of energy and no time commitment other than work. They also follow their managers’ directions very closely, as they should in a startup. My general feeling about the managers is that they are very adept at avoiding the commonly documented pitfalls, and follow the general principle of management very well. I honestly don’t know how to be better managers. What I do know is that their vision for how employees should fit into the grand design of the company spells poor wlb for the employees, even the smartest ones. To me that’s not sustainable, not only for the employees, but the company as well.

To conclude, I am still a big fan of being DRY. I think it’s a word that should be more popularized. I never anticipated people would apply it to people management as well. So while shocked and slightly depressed after such treatment, I am also amused by the nerdy-ness of my manager. At the end of our brief 1-1, he said that I should keep him in the loop. I guess that’s an important signal I missed earlier. Such signals are difficult to convey in a neutral manner, so I applaud his courage to let it out. I should do better to keep him informed, though I had 2 concerns: a. he usually appears very busy and I don’t feel the need to add to his workload; b. he is generally not encouraging on directions that does not align with his vision, yet at the same time he wants me to come up with plans to save a big part of the company. I think I did my best to stay productive while taking the appropriate risk of self-isolation.

Posted in Uncategorized | Leave a comment

My grievance towards my current job

The VP that I work with acts as a code review czar in the company. I didn’t realize how demoralizing such dictatorship can be until recently.

I have been trying to push a PR that implements a convenience function that I use in jupyter notebook. The whole point of convenience functions in my opinion is that they can be written relatively quickly, and iterated over time, that is, the goal is to achieve efficiency quickly, rather than through extensive polishing.

The VP however wants it to be extremely high quality, in the event it gets used by others in production. Now I am not unused to such strict code reviewers, coming from google. So I offered to move the function to a non-critical file in the mono-repo. He didn’t like the idea, presumably because he thought it would still be abused somehow.

Now I would call that highly misplaced code review priorities, let alone mistrust in senior engineers. There are far more low quality code strewn throughout the code base than the one I was contributing.

In the process of code review negotiation, which I have never been fond of, I offered under his suggestion, to rename the function to something like unsafe_slow_xyz, and to add plenty of warning in the doc-string to indicate non-prod use. I am certainly not well trained to write bullet proof production level code. However I do understand when to exercise such caution and when not to. I found the fact that a VP spends so much time torturing me over a single function implementation to be unwise. It breeds distrust, inefficiency, and represents a form of intellectual enslavement. In the end, I closed the PR as a soft slap in the face to him. I hope he has learned the subtle lesson. Maybe he is so much of a nerd that he doesn’t care about others’ feeling.

A few more things I can add about this: doing such strict code review takes away the fun of data science IMO. All I am trying to do is fill a well-needed gap in the offline exploratory tooling. By enforcing strict code review standards, I am much less inclined to do such altruistic things in the future. Maybe it’s the reason why I don’t have a high rating on stackoverflow.

A second incident happened after I implemented an ad hoc wrapper around popular deep learning framework. This solved the urgent need that the existing wrapper was not able to train production models correctly. To appease the ego of implementers who are a decade younger than myself, I carefully named the project with the prefix “ad hoc” to indicate that it’s in no way a power-grabbing move.

I believe the project was well received, but then there is the initiative to merge it into the existing framework, to become a part of the latter. I had no issue with that either, but what I did wrong was to accept the task of doing the merging. The trouble is that I had no desire to try to understand the design choices of an in-house wrapper of an already very handy open source ML framework. I’d much rather spend my time understanding the latter and doing the least I can to adapt it for in house use, without introducing any unnecessary boilerplates.

But being a nice person that I always appear to be in public, I accepted the task. In the end, I never found time to complete it, but it left the impression of indecisiveness and empty-promise to my colleagues.

The perhaps single source of frustration so far, however, is the use of dependency injection, in an attempt to discipline coders into the most schematic form of writing. By playing with such disciplinary tools, I come to appreciate the value of self-discipline, rather than externally enforced ones. After all, an adversarial can always get around the disciplinary boundary. Massive amounts of effort are currently being spent migrating code to be DI conformant. Several criticisms of DI are in order.

First, the error log is extremely opaque, since the stack trace needs to unwind all the nodes in the DI graph. Now our VP has promised to make the DI stack trace more transparent. But to me that seems to royally distract from the mission of the startup company. This is far better suited for a well-established company, trying to sell an IDE or related productivity software. To hijack the whole company’s engineering culture with a personal pet project seems ridiculous.

Secondly, the use of globals are generally frowned upon in any language, but in DI, the entire graph, which encodes much of the actual logic, is a giant global variable. As one can see easily, this does not play nicely with IDE’s.

Lastly, the implementation of DI in languages other than python is spotty at best. So much boilerplate code has to be added to support it. There is probably a reason my previous jobs never required me to write DI code, even though it has been around for ages. Initially I thought DI will make it much easier to write unit tests, for instance, through free-mocking. It turns out in the ideal world, the different providers are small enough units to tested easily through UTs. But in reality they have so many upstream dependencies that they are just as hard to test as ordinary functions or classes. So their value becomes much less clear to me.

From now until I quit my current job, I will maintain my opinion about certain tradeoffs between convenience and formality. I simply do not believe the various coding standards being enforced have any scientific justification, to quote a colleague who abuses that phrase. It’s more of a strong personal opinion, that must be followed to establish an intellectual pecking order.

I will probably try not to be surrounded with young and testosterone-driven colleagues in the future either: it distracts me from the really important work that matters to me personally. I am surprised that people are willing to give up money to be surrounded by such people. Such people are indeed pretty rare, and I simply made an uninformed bet to be one of them myself. This may be the first job I will leave without reaping the full first year reward, and possibly need to regurgitate some back. But the very reason I left my previous job was to regain the freedom of career change, so I won’t be deterred by such minor financial details.

In the end, work should be fun, not only a so-called learning experience. Of course I am speaking from a highly privileged position.

Posted in Uncategorized | Leave a comment

Recognizably depressed at work

The amount of talent in my current seems to be overflowing. This was what I intentionally sought for when I decided to leave my previous job. It wasn’t that I was bored or felt my job was too easy, but I wanted to have intellectually more stimulating conversation.

Also throughout my career I had always played the role of nuts and bolts, rather than felt in control of the product itself. When I first started my career, I vowed to never join a startup. Now that has changed after roughly 10 years. I guess that’s natural.

In my current job, I desparately want to simplify things. I don’t understand how a product with so many moving pieces can possibly be maintained and reach profitability in the long run. I understand people around me are generally smarter than me. Indeed they seem to remember things better, and always know when to complain about things that are not due to their own fault. I also meet people who are highly detail oriented, and work with sql queries like bread and butter.

I am impatient when it comes to running sql queries. I seek to run everything in distributed/multi-processing mode. Almost nobody seems to share the same level of impatience. My entire mindset is around how to make things faster. I guess if you don’t have a product yet, there is nothing to optimize. But the 0 to 1 transition seems to be what I am hired for, and I feel highly under-qualified for such role. My manager doesn’t seem to understand my strengths and weaknesses, but I am sure he will eventually.

In terms of the ranking job I am doing, I want to essentially build a single neural model and declare victory. I know the traditional signals are important for cold-start, but I don’t see them as adding any value past the cold-start phase. My previous job has reinforced my view. There nobody cared about classical techniques like bm25. I am mildly surprised there are much younger folks at my current company who disdain AI and want to go back to old school information retrieval. I guess they are energetic and precocious, and I haven’t really advanced much beyond the basic supervised learning mentality.

I just learned on fb that a remotely acquainted Korean ex-coworker seems to suffer from depression. It appears Korean Americans have a decent depression rate, based on 2 sample points. I was told that I look Korean myself, so maybe there is some risk there.

Overall, perhaps I need to let life control me more, and completely give up struggling. I often blame myself to not anticipate difficulties at certain stages of life, or simply not having enough faith in their reality. For instance, older people could suffer from pancreatitis. And genetically I know I am highly likely to as well, given my maternal grandpa’s history and my own digestive woes. But I seem still not to pay enough attention to this possibility. Hopefully writing things down adds more weight to such thoughts.

Overall I just don’t understand why people write so many complicated pieces of code, and what’s more, that they can stomach that. It seems different people’s brains have evolved differently.

What feels like the low hanging fruit to me seems laughable by others. I am surprised after 10 years of industry experience, I still haven’t reached the level of common sense. I guess one needs to actively seek out opportunity of painful growth, otherwise he/she is just living in a comfort bubble, although it was never comfortable for me except rare occasions.

One who remains calm in the face of comfort may be better suited for big corporate lives, and ultimate wins the race of life. I can’t say I don’t yearn going back to the big corporations, which at least provide some near term illusion of stability. The pain I acquired from the past experience in big companies is gradually fading. I only remember it as an abstract concept nowadays, something I might have told my future self at the time, in anticipation of my forgetful personality.




















Posted in Uncategorized | Leave a comment